Earlier this week, dbt Labs CEO Tristan Handy wrote a post making the case for new standards to solve the metadata problem. My goal here is to continue the conversation. In Tristan's words:
"In a sufficiently complex organization, it is not good enough to find a table called customers in your warehouse—you need to know how it was produced, who built it, when it was updated, etc. in order to make use of it."
I could not agree more with this premise. The Modern Data Stack has led to two phenomena:
a) Reducing the barrier to entry for producing and consuming data
b) Creating best of breed products
These in turn lead to a lot more data but also a lot more chaos. More chaos means more questions like the ones below, referred to as ABC of metadata1:
Application Context - Where is the data? What are the semantics of the data?
Behavior - Who is using the data, who created the data?
Change - How has this data evolved over time?
Tristan then describes the real barrier to solving this problem, which I also agree with:
"the hard problem is: how do you get an entire ecosystem of vendors to build products that answer these questions?"
Now, the point that I disagree with is the following:
"It will either be solved by product integrations in an open ecosystem (requiring standards) or via commercial consolidation."
There is a third way - the integration service.
Let’s take an example. The average organization today has dozens of SaaS applications2. These applications need to send data to the data warehouse, which we use for understanding, running, and evolving our businesses. However, these SaaS services are neither consolidated, nor do they use a widely-adopted standard. This is true for Salesforce, Marketo, or whatever application commonly sends data to the warehouse. There are data integration services, like Fivetran and Airbyte, that have hundreds of connectors that extract and load your data from disparate SaaS services into the warehouse. Not only that, now there are vendors that reverse ETL your data from the warehouse back into these same SaaS applications.
This activity is not just happening between SaaS applications and your warehouse. For example, Segment, a customer data SaaS, instruments and ingests data from disparate web and mobile sources.
So, the interesting question is what circumstances lead to commercial consolidation vs. open standards vs. integration services?
There are at least two factors:
- Maturity: less mature products tend to be focused more on their core product use-cases and less around integrations
- Purchasing patterns: Do users involved in making a purchase decision value adherence to a standard, or do they just care that integration is handled and effective?
Bringing these back to the metadata problem:
- Maturity: The modern data stack is less mature than major SaaS products I mentioned earlier. This means that more often than not vendors will be focused on improving and scaling their core use-cases in the near term.
- Purchasing patterns: More often than not, users want effective integration but do not care how it is done.
I have a biased view3 but my prediction for solving the metadata problem is that it will follow a similar pattern to what we followed for the data problem. Just like Fivetran and Airbyte act as data integration services, there will be metadata services that will ingest, aggregate and recommend based on metadata from across multiple products and to enhance the overall experience for the data users.
Only time will tell who is right. If you think there are other important factors besides the two I mention, I would like to hear them. What do you think?
[1]: Terminology from Ground, A Data Context Service
[2]: Statista, Average number of software as a service (SaaS) applications used by organizations worldwide from 2015 to 2021, accessed August 16, 2022
[3]: I am the co-founder of Stemma, which is a metadata integration service (aka data catalog). Prior to that, I co-created Amundsen, which did just that at Lyft.