On Data Centralization Disease and Data Fabrics

DataData VirtualizationData ScienceAnalyticsModelOps

Jul 3

Why do so many chase the dream of a single centralized data store? It’s neither practical nor desirable. And, there’s a better way: data fabrics.

"Our executives decided to centralize our data in a single data store." At a glance, centralization edicts can seem like a good idea: one data store is better than many, right?

Wrong. Data diversity is a better idea. We know this from the real world: diverse opinions forge better societies, consumers crave choice, and nobody wants the same team to win the Super Bowl every year (even Patriots fans :)).

No, data monoliths are neither practical nor desirable. And there's a better way.

But let’s back up: why does data diversity matter? And why is there so damn much data in the first place?

The Never Get Enough Principle

As MIT economist David Autor explains, we humans can never get enough. That is, humanity has an insatiable desire to invent and create. As soon as we improve one area of our lives, we develop new toys and tools. We can never get enough.

Autor uses the Never Get Enough principle to explain why new jobs appear while old ones are automated away. For example, since the automated teller machine’s introduction 45 years ago, the number of human bank tellers in the United States doubled from 250,000 to 500,000. In the face of overwhelming automation, why did teller jobs increase? Contrary to popular belief, automation magnifies the importance of human expertise, judgment, and creativity. With cash dispensing delegated to ATMs, banks invented ways to put tellers to good use. Human-centric customer service, and the loyalty it creates, is now a fundamental business success factor.

All these new jobs create new data, which motivates a principle derived from Autor’s: the Always-Have-Too-Much principle of data.

The Always-Have-Too-Much Data Principle

Thanks to the Never Get Enough principle, new data appears like waves of stampeding rhinos. In just 50 years, the data management market grew from nothing to $189 billion. And the stampedes come in waves: mainframes, spreadsheets, relational databases, NoSQL, streaming, cloud, data science stores, graph...

The technologies required to manage each wave of data do not supersede each other; they accumulate like a snowball rolling downhill.

So when it comes to data, we always have too much, because humanity can never get enough. And the tools required to manage it build up over time.

Healthier Data Management Habits: Data Fabric 101

So what do we do with all this data? Centralization is neutralization--it saps data of its value. And it’s fool’s errand because the Always-Have-Too-Much principle says we’ll keep creating more and never catch up.

Data Fabrics, an increasingly popular architecture promoted by analysts at Gartner, offer a data management best-of-both-worlds approach: data diversity and ubiquitous data access at the same time.

Two elemental strands in a data fabric are Metadata Management and Data Virtualization. These technologies leave data in its original form and weave it together dynamically when needed. Authentication, security, and control are maintained at the fabric level while data stays in its most valuable, raw form.

Five threads form a complete data fabric:

Data virtualization converts data from its original format to neutral formats (e.g., SQL) required for jobs like reporting. Like a duck paddling on a pond, data access is easy, while virtualization paddles hard below the surface.
Metadata management / MDM is data about data. It's the DNA of the enterprise that describes the fabric.
Data quality. There are 75 million people named Wang in the world. Millions of people in the United States live on "Main Street" (there are 7,000). Data Quality tools help companies correct and therefore trust data in the fabric.
Data catalogs are like Google for enterprise data. They provide self-service access to find and transform data from the fabric.
Data science ModelOps, or Model operationalization tools, are like Uber Eats for data science algorithms. They deliver algorithms (food) from the data science kitchen (lab) into production (to the customer or business users). Operationalized models generate predictions on the fly, which belongs in your data fabric.

Be careful. The term “Data Fabric” is often distorted by software vendors with selfish intent to own your data (1). If your CRM or cloud provider promotes a data "fabric," do your research. Read the Gartner reports. Fabrics are an emerging technology, and innovation is fast and furious.

Our Future is a Data Tapestry

Thanks to human ingenuity, the never-get-enough principle will reign forever and create more and more valuable data exhaust. Data fabrics form an agile, robust, flexible data tapestry to turn data into value, innovation, and insight.

Prefer to watch this blog post in video form? Here it is, an animated and narrated 5-minute argument for data fabrics as the antidote to data centralization disease. Enjoy!

FOOTNOTES and BIAS DISCLOSURE

(1) Full disclosure: I am the head of products for a company that provides data fabric tools, but does not have a vested interest in “owning” your data. My company was founded to be “Switzerland” for enterprise software assets. That said, we do have a selfish interest in selling data fabric tools. That said, this post does not argue our approach is superior to our competitors, for which we have great respect.

Data FabricData Virtualization