What is a data mesh

Data mesh is a set of principles for designing modern data architecture introduced by Zhamak Dehghani in 2019 (so it is still in the early phase of development). It is not a set of tools - it’s a platform-agnostic approach to managing analytical data and their flow in a company, both in terms of technology and sociotechniques. Data mesh borrows heavily from some concepts in software development, like domain-driven design (eg. the data modeling should be domain-specific) and theory of team topologies (eg. self-serve enablement). There are also many parallels between data mesh and microservices: both were born to address the problem of scale and complexity via decentralization. In the end, a lot of data mesh comes down to social techniques and data politics. Let's drill down into this topic, starting with the history of how it came to be.

Evolution of analytics data systems over recent years

Companies of all sizes went through multiple trends of approaches to data platforms. One can oversimplify and broadly sketch two eras in this timeframe: the data warehouse era and the later - data lakes era. (Zhamak Dehghani, the data mesh approach's author, argues that we are currently in the third generation, which she describes as augmenting data lakes with real-time data.) Data lakes addressed some problems with data warehouses, including handling the unstructured data, and more importantly in this context: agility. These eras undergo evolution, resulting in their new generations; the features that work best are polished, dead-ends are being backed off from and new approaches are experimented with. For example, the recent “lakehouse architecture” is a combination of techniques from data warehouses and data lakes.

As mentioned, a trend of software developments’ transitioning from monolithic architecture to microservices can be seen here as well. That’s because monoliths, if not perfectly thought out, can be blockers to change - it’s hard to be agile with a great mass of coupled functions. The process of change may be so tedious to implement, that simple changes take weeks, and new features - months. The same can be true for the centralized data platform approach, especially in data warehouses. That process relied on ETL processes that in turn required a heavy modeling phase and were removing some of the signal. This prompted an organizational change in the thinking about the responsibility for processing - from functionally divided to domain-oriented. The sheer complexity of the software in organizations was being answered by decentralization and shifting the responsibility closer to the teams that had the domain knowledge. Dehghani calls the current approach monolithic in nature.

Status quo problems

Many early data management challenges centered on the technology itself—things like volume, velocity, variety, and veracity (“4 Vs”). Tools like Hadoop and cloud data lakes have addressed some of those technical dimensions.

However, other problems remain less technological and more political. These include coordinating an ever-growing diversity of data sources, use cases, and internal consumers, while also adapting quickly to changes in analysis needs. There are no one-size-fits-all solutions, as companies have unique environments and priorities. But the human and organizational aspects of data management require just as much attention as the technical ones.

Companies are often torn between two polarized approaches: either let each team manage their data as they see fit or create a centralized repository managed by a central team of data engineers.

Data warehouses relied very heavily on centralization and a central data team. Oftentimes, that team was tasked with preparing the data from different domains putting together the data landscape of a company. That approach allowed for the governance of the data according to universal company standards, a single source of truth and other benefits. It, however, brought its own risks. This approach is prone to a lack of agility (e.g. the slowness of integrating new sources). It also lacks the benefits of domain knowledge and the data context – the central team may not know what sources are most trustworthy, what features are most important or what is the correct way of processing them. They didn’t have time to do that, as they were burdened with bug fixes and new integration requests. The resulting architecture is orthogonal to the direction of change – the growth of ideas and needs is not reflected directly and immediately in the architecture itself.  

On the other hand, the single-data-lake-many-admins approach can also be problematic. Without centralized governance and curation, data lakes easily turn into unfettered "data swamps" – disorganized pools of datasets in various states of documentation, transformation, and integrity. This makes it extremely difficult for other teams to discover and leverage data products that would advance their analytics.

The other, decentralized approach (at least without an overarching “company philosophy”), seems to be the hardest to implement correctly. Each domain produces and governs its own data. It makes sense because the data is processed by the people who know most about it. However, it requires teams to have data infrastructure and engineering skills. In practice, this also means a lot of copying the data from other domains to integrate with the team's data. This, in turn, creates data stateless, “where-does-this-come-from-and-why-is-it-here” datasets and a lack of knowledge when it comes to the policies and regulations of the original data. Of course, there is also the dreaded “data silo” – datasets are hardly used outside of the team; they are wasting their potential to enrich other teams knowledge. 

I don’t believe that everything before data mesh is bad: centralized architecture has its place and works great in some companies. The data mesh approach is not the only thing that can be implemented in a decentralized model. I will elaborate on the profile of the team that is required for the successful implementation of data mesh further in this article.

But, what REALLY a data mesh is

After listing the challenges of the current data landscape, we can understand how data mesh was born. At the beginning, let’s reiterate: data mesh is a fresh concept and as such, does not provide specific instructions for its implementation. It’s not “how to transform a denormalized data set into 3rd normal form”. However, it has four pillars that should inform how the architecture should be modeled and used.

Pillar I: Domain ownership

The first two principles seem the most important. The domain is responsible for their data. They have the deepest understanding of their data – its quality, how well which dataset is aligned with reality, what is missing and how it was generated. Centralized teams, on the other hand, are “strangers” to the domain: they lack the understanding of the data that the domain has. When data is processed closest to the source, it is the freshest and can be presented in the right light, focusing on its most important aspects.

The speed of change is a factor here as well – domains don't have to wait for the implementation of their ideas or create the pipelines that populate the data warehouse by a central team. They can prioritize stuff to their own needs, which according to Dehghani is quicker.

Pillar II: Data as a product

The data mesh approach advocates treating data like a product, where domains take ownership of the data they generate and provide it to internal data consumers. This shifts the mindset from data being a byproduct to being a primary output and asset that domains actively curate and share. They want their data to be easily accessible and discoverable. To meet this, their data product should adhere to standards set by the data mesh committee. Those standards include:

  • Addressability (the ease of use by consumers, self-describing features)
  • Trustworthiness (having SLOs and data lineage)
  • Interoperability (among others, by aligning products to companies’ standards)
  • Security and governance (with the collaboration of central teams, according to global standards)

This in turn requires the infrastructure and help provided by the central data team.

Other challenges brought by “data as a product” are staffing-related. To meet the expectations listed above each team must have data-literate people (however not as specialized as data engineers, analytics engineers, DevOps and a plethora of other roles needed to run centralized systems). A new position would be created: data product owner whose' responsibility is to meet the expectations of teams using their data and care for the satisfaction of their clients. They must steer the data product, provide its roadmap and work on the quality of datasets.

Pillar III: Self-serve architecture

Dehghani points out the importance of the ability to navigate the data product market on your own. No middleman who connects the consumer with the producer is needed. The consumer team has to have an easy way to browse all the data they might need. A data catalog of all data products, with a search engine that provides normalized, detailed and up-to-date information about the datasets is needed (think “IMDB for datasets”). This stems from the idea that data is a product – the producer wants to sell it and consumers need it to provide better services. 

The need to have a framework for data mesh operations is similar for the data producer – there should be no need to ask external teams to set up the data sharing and updating. It has to be easy and automated as much as possible, providing features like data lineage out-of-the-box.

Pillar IV: Federated computational governance

I like Dehghani's description of federated computational governance:A data mesh implementation requires a governance model that embraces decentralization and domain self-sovereignty, interoperability through global standardization, a dynamic topology, and most importantly automated execution of decisions by the platform.”

Federated governance balances centralized platform tools and policies with domain governance of their data based on specialized expertise. Typically, there is a central data standards committee and local data stewards responsible for compliance in their domains. This allows for consistency, oversight and context-aware governance.

There are multiple interesting concepts about the federated computational governance model. First, it’s more of an ongoing process rather than a fixed set of policies – rules, categorizations and other properties evolve as the needs change. Second, this requires collaboration between central governance and local (domain) one: global standards vs domain-specific policies. Third, the specification is not something that can be decided in advance. Rather it’s something organic that evolves with time.

In summary, federated computational governance enables decentralized data ownership and flexibility in local data management while maintaining standards throughout the organization.

Data mesh as a remedy to status quo problems

The problems of the current data landscape, as defined by Dehghani, are also addressed by the ideas that form the data mesh paradigm. Let’s go through the blocker-enabler pairs that she proposes.

  • Monolithic and centralized architecture from which many of the problems stem - replaced with one that is aligned to the business domain, both in terms of the data and technology.
  • Loss of the business context - domains provide access to both operational and analytical data.
  • Cross-team data pipeline operations burden, errors - remove ETL/ELT from producers to central data engineering teams; code to transform and load data resides together with the raw data in the domain's space.
  • Data ownership problems - clear ownership (domains).
  • Lots of data engineers needed - generalists can use it.
  • Problematic governance - governance embedded into the data platform.

Data mesh guidelines and implementation

On one hand, Dehghani says “I have intended for the four principles to be collectively necessary and sufficient”, but on the other most data architecture experts say that they have not seen a data mesh as per Dehghani's complete vision implemented yet. Some implementations follow the spirit of the four pillars, but always some freedom promised by the data mesh is compromised. There are also no dedicated “data mesh” tools or platforms (although Dehghani founded NextData https://www.nextdata.com/).

James Serra in this presentation lists three architectures that “approach” data mesh (ordered by decentralization level):

  1. Same technology used across domains, central storage used by all domains.
  2. Same technology used across domains, they use their own instance of common storage technology (eg. own S3 buckets with parquet files).
  3. Domains use any technology and storage type they want.

The third one is the “truest” to the book, but with current technology, it is also the hardest to implement properly. What is left is some compromise in the name of performance and administration, a “partial data mesh”. However, these seem sufficient for some big players (as listed in further sections).

The role of the domain data team is to provide quality products: well described, up-to-date and aggregated to consumer needs. The “offer” of the publisher should include:

  • Clean, timely and well-described data, for new data and historical snapshots (perhaps aggregated).
  • Metadata – documentation, glossary, semantic and syntactic declarations (what’s in the data and how to read it), perhaps a changelog.
  • SLA includes the terms and quality measures (like timeliness and error rates).
  • Code related to the ingestion (or production) of the original content, its transformation and exposing API.

To accomplish serving multiple clients, data duplication might be needed.

Keys for success

I don’t think that most companies will find that full-blown data mesh is the right approach for them. There, I said it. It requires specific conditions, huge investment and a trailblazing approach to technology.

Perhaps the most important thing to remember is that it is mostly a sociotechnical approach, not a technology. If your company won’t be on board with decentralization and treating data as each domains’ product, you will not be able to reap the benefits. Teams need to think about themselves as servants to an organization (“anit-data chauvinism”).

If you want to implement data mesh in your business, first list all the things that are wrong with your current system. Do they align with the “blocker-enabler” list by Dehghani? While not universally applicable, this list can spark useful self-reflection on alignment with data mesh principles within your systems and team structures.

If the lists align, check if you score at least “medium” in all categories mentioned in her book “Data Mesh: Delivering Data-Driven Value at Scale”:

  • Organizational complexity
  • Data-oriented strategy
  • Executive support
  • Tech at core
  • Early adopter
  • Modern engineering
  • Domain-oriented organization
  • Long-term commitment

You can read a bit more about these categories in this blog post. 

If you’re still interested in data mesh in your organization, consider the money investment and time needed to make it happen.

This is a fundamental but archivable change: as mentioned further there are great companies that implemented it and feel that this was the right move. However, you should be aware of the challenges it may bring during implementation and usage.

Human component

Another very important point in data mesh discussion is navigating the change in peoples’ approach and the impact of organizational change on their work. A lot of components of adopting the data mesh can result in the team's resistance to change: transitioning to a decentralized paradigm will affect workplaces’ skill requirements, work patterns, budgets and power dynamics. At the same time, changing parts of the codebase is inevitable and may be problematic to fit under the budget, while keeping the legacy systems online.

One must also take into account that the data mesh approach requires creating new teams, like a “steering committee” for the data mesh. Those steering committees’ decisions regarding standards may be frowned upon by the domain teams because they have spent large amounts of time and used their best knowledge while creating those. For many organizations, the challenge of hiring data engineers for each domain would also be an important consideration.

As a result, leadership must anticipate and mitigate the churn stemming from people, not just technology.

Technology ecosystem

Because of the freshness of the idea, tools that are essential to the data mesh idea do not exist yet, are adapted from other areas, or are being built by the very teams that are using them. For the data mesh to “spread its wings”, multiple systems have to be working together. Especially, for the third and fourth principles (self-serve and computational data governance) the ecosystem is lacking. Essentially, what is needed is an infrastructure where a domain can plug in with their data system, adapt its products and get an evaluation of their fit for this system. This infrastructure should implement a “data marketplace”, where data products can be found and their fit for usage by the consumer could be evaluated. A centrally managed governance system is needed to fulfill the requirements of the fourth pillar of data mesh.

Of course, the cloud providers try to follow the market trends and products like AWS DataZone, Google Cloud Platforms’ Dataplex and Azure Purview are being released. These are usually a layer on top of other existing technologies, and they don’t allow the full realization of the data mesh idea. One can also think of repurposing existing tools for usage with data mesh functions, for example:

  • Metadata catalog (eg. Datahub, Colibra)
  • Data lake (eg. Glue + S3 for AWS) or data warehouse (eg. Snowflake)
  • Data virtualization (eg. Dremio) or data transfers (eg. Meltano) or common access streaming (eg. Kafka)

That would still require experimentation and writing additional, in-house software that would allow this to work.

Critique and concerns

There are quite a lot of concerns related to data mesh.

Let’s talk about data mesh maturity. Cutting to the point: it’s not mature. This means that definitions are flexible, boundaries unknown and little guidance is provided. Some experts believe that for this technology to mature, we need at least 5 more years. Some (like the Gartner report from 2022 linked below, repeated in 2023 - the latter is not publically available) say that this is hype and won’t reach widespread adoption. This also means that there are different views on what a data mesh consists of, from people either including transactional (operational) data or not, to Confluents’ Data Mesh tutorial, where everything is converted to a stream on the same platform.

And coming again to the toolkit: companies need to develop some software tools on their own.

Other challenges come from the fact that data mesh is mostly about the company data culture. There needs to exist a prioritization of thinking in terms of data products. There should be very little help from outside – the team is responsible for providing the best data they can. Following that is the need to have “data literate” people on that team and someone like a data engineer who will know how to structure and publish the data. 

From a team perspective, it is also worth noting that “data as a product” may mean creating multiple versions of datasets by the producer (so that other teams will find the view of the data they need). One of the most important things I keep thinking about is the incentive from teams to work hard with data publishing, when their prime goal may be something different. The company policy should have a clear expression of what is expected of them in terms of data. This all means great investments of time and money, steering long-term change. 

Let’s remember that data mesh is, by design, decentralized – this introduces both advantages (as discussed before) and potential disadvantages. Data analysts may have problems understanding the relationships between the datasets or knowing when to use which one – the knowledge that is often gained by years of interacting with the entire data ecosystem. [e]. Some believe that a “data librarian” (a person or team keeping the finger on what data is in the company) is necessary. If a domain has to release a product, it may do it according to its best intentions, but because of a lack of “contracts” between the producer and consumers, it may fail to deliver the data in the form that is optimal for all of them.  [b]. I think the assumption is that the “invisible hand of the (data) market” will organically make it work.

Decentralization also means addressing the problem of having a common business glossary and ID mappings. There needs to be a way to be sure that the company speaks a common language and has a way of identifying entities accurately across teams.

Coming to performance, we hit another challenge: unless all data is kept on the same platform in a unified, indexed way, the performance will be hindered or a lot of data duplication will be required. Data virtualization that would be required in this area is still quite a new topic and having data eg, on different clouds will always mean slower queries.

Companies that implemented data mesh

Some important big companies implemented data mesh architecture. We should take it with a grain of salt, as such articles almost always only point out the good stuff that has happened – it’s hard for a manager to admit: “I screwed up, and the money that went into that investment was mostly wasted”.

Unfortunately, not much is known about smaller companies that implemented it. I think this is due mostly to two facts. The first one is the scale: within smaller companies, it’s often easier to wrap your head around the data that exists. This makes a centralized data repository easier to maintain and develop. The second one is that there is simply not that much of an audience for the histories of smaller companies – only the stories of the big ones are highlighted. Therefore, while data mesh is being advertised as a solution that will work also in smaller companies, there is not that much information about it.

Final thoughts

I think it is great that we’re exploring new ideas and architectures (one can even say philosophies) in the world of data. We are well on our way to making it easier for less technical people to create data tools. It shows the progress from the era of people having to ask data engineers “what is the average price we pay for fuel” to “provide me with tools with which I can share my analysis”.

Related to the data mesh idea is the evaluation of the trustworthiness of the data – teams are expected to provide the SLAs’ for their data. With that, one does not need (or at least “should not need”) to call each provider of the data to hear “let me ask my teammate when he updated this dataset; he says he does not remember, perhaps last month”. The goal is for the consumer to trust the data source so that they can make informed business decisions.

I (from my nerdy point of view) love that this creates a world of “data care” – domain owners working on the quality, description and usefulness of their data, the ability to add additional information by other teams, both enhance the data (for example with the data from their department) and the metadata (review the SLAs’ of the publishing team, show their view on what the data can be used for, etc). Also, how the data can be automatically ranked by the count of downstream users (or revenue brought), the ability to create certificates of the data having the highest standards – the possibilities are huge.

Pinpointing the responsibility for the data errors is also a great thing and should allow for quick and global problem resolution.

However, unless this strikes a chord with your view of your company and you meet all the points mentioned in “Keys for success”, I would wait some time. The critique seems viable, and the whole idea and tooling around it is a “work in progress”. I believe the pillars all have their place in different companies – I’m not sure if one needs to buy the whole package at once.

To answer the cliffhanger from the subtitle: I think that the real question should be: does data mesh fit your companies’ culture and roadmap?

Appendix: trending (and confusing) concepts in the data world

Data virtualization – approach to abstracting data sources, allowing to query datasets without coping the data, eg. to another cloud provider. May support the data mesh by decoupling data consumers from the actual storage.

Data lakehouse – an architectural pattern that combines data lakes and data warehouses. Allows for using low-cost storage and having the benefits of both structured and unstructured data.

Data fabric – a framework for connecting heterogeneous data, potentially across multiple platforms. May be helpful in data mesh to link domains and facilitate data sharing.

MDM (Master Data Management) – common name for tools and practices for managing “master data” (a single source of truth for most important entities in a company). May be useful in providing standards for that kind of data in data mesh.

Related URLs