I was able to catch up with Chad Sanderson, Head of Data at Convoy, recently and ask him deeper questions about his idea for the “Semantic Data Warehouse.” His recent post regarding this topic has stirred up quite a conversation, which is not a surprise. Chad clearly explains the problems data professionals face and often couples that with out-of-the-box thinking..
In this first part of our discussion, we focus on semantics, how it should be defined, and how the Semantic Data Warehouse fits with the modern data stack. The TLDR for me is:
- Analysts need a better way see and agree upon ground truths in the business before doing analytics
- Semantics is the level of abstraction we should be considering for data and we should implement that in warehouse schemas
- The modern data stack is doing a great job with procurement. Now we need to focus on helping analysts better understand the data they are getting.
I’m interested in what y’all think so feel free to reach out before you start reading part two. Without further ado, here is my conversation with Chad.
Mark: You’re tackling two big topics with the Semantic Data Warehouse idea. You’re saying most teams are using the data warehouse incorrectly, and you are arguing that popular ideas for the Semantic Layer are headed in the wrong direction. I would like to start with “semantics.” Is this a simple matter of preference or is there a compelling reason the semantic layer should be in the warehouse?
Chad: Absolutely, people pushing for semantics outside of the warehouse are using the wrong terminology. “Semantics” refers to the underlying meaning of a thing. When we are building data platforms for analytics, the “things” we should be focused on are the business critical entities that reside in the warehouse. When you read up on the history of the data warehouse, a major theme is that it is supposed to reflect the state of the world we are analyzing. The semantic structure needs to be there in the warehouse in order to properly reflect the world.
Once you have those entities semantically defined, you can then derive things that are meaningful to the business, like “margin” for example. All of these logical derivations and metrics are what we create downstream based on those real world entities. What some people refer to as “semantics” in the modern data stack is more like a metric layer or a logical layer.
If you don’t have a standard semantic representation for the objects underneath that derived information, you are setting yourself up for a problem. Instead of getting trusted objects that can be treated as a source of truth to build insights off of, the analyst has to figure things out the best they can. They’re writing an enormous amount of bespoke logic to create metrics that get adopted by the business. And what is worse is context for that logic never leaves the analyst’s brain. That person leaves and you are left with a thousand-line model with references nobody fully understands.
By contrast, at Convoy every analyst creating metrics should understand exactly what the shipments table refers to. Those entities are not something for people to re-discover and reinvent on their own. There are plenty of other problems that require their brainpower that can build off of the established context for the shipments table.
As an analytics practice, using semantics to enforce ground truth in the warehouse would eliminate confusion that many teams face. But for the design of the warehouse itself, you say that you incorporate many of Inmon’s principles. Have you rejected anything or is there part of your approach that is at odds with traditional methods?
I wouldn’t say I have rejected anything outright. Many proponents of the traditional warehouse think architecture should be “siloed.” By that I mean only a few people should do architecture for everything. I do not think that is the right way. I think that work has to be federated. If you have a process where you iteratively add on the pieces of the business that you understand, then you can build out this semantic layer in a way that is agile.
“Data-as-code is the wrong level of abstraction for data teams, while semantics is the right level of abstraction.”
This process ties right back to the importance of Semantics in the warehouse. I think the reason that something like GitHub has not taken off for data is because people are over-indulging in analogies from engineering. “Data-as-code” is the wrong level of abstraction for data teams, while semantics is the right level of abstraction. And so iterative addition to the semantic model will allow the right people to review and make changes that will benefit analytics and therefore benefit the business. This is a very different concept from legacy data warehousing. I don’t claim to be the first person to identify this but, if anybody has built this, I haven’t seen it.
If data-as-code is the wrong direction, what do you think of the modern data stack (MDS) as a whole? Is it helping or hurting analysts?
My critique of the modern data stack is that it is focused almost exclusively on data procurement. Let’s contrast this in a metaphor where we build a complex, real world object like a house. A house goes through several specific steps, each with different stakeholders and different requirements. In the first step, you pick a location and have a rough idea of the materials needed. In the next step an architect creates a floor plan which helps the developer better understand the requirements. And then you render the floor plan so you know what it should look like after the fact. From those activities you have all the context to figure out what materials are needed and what actually needs to be built. And then it flows to procurement who figures out how to get quality materials on schedule. Then there’s development and implementation, where the house is actually built.
If we were building that house with the MDS, we would quickly grab a bunch of raw materials, dump them at the site, and then hope the construction team would know what to build. In this metaphor it’s obvious there will be chaos but when it comes to data, for some reason it’s not obvious or at least not as outrageous. But if you speak to analysts, you’ll know it is pretty bad.
“The primary theme of the Semantic Warehouse is that if you are purposeful in your design and you allow your downstream tooling to inherit requirements, then you can actually build a scalable, maintainable warehouse.”
I’ve seen a specific example of this at Convoy with the pricing team, the backbone of the company. They had a machine learning model that predicted how we should price our shipments in our marketplace. But, at the time, they were dropping a non-trivial amount of data from their training sets. When they tried to solve this problem they kept running into a lack of ownership. One important data set for training was unbelievably wide. It had existed for about five years and people had been incrementally adding columns to fit their use case. It was breaking constantly because nobody was maintaining quality. When we tried to fix it we realized Production was the de facto owner, but they didn’t see it that way. So here we were, relying on a massive orphan of a data set. Back to the house metaphor, not only were the materials just dumped at the build site, we had trouble getting the supplier to help us sort through what we had.
A big part of this problem, in my experience, is that Production has no idea what is going on in the warehouse. They are totally abstracted away from it. Asking for help from an engineer would be completely different if they knew their database was feeding features for the pricing team and that a change to a column would break the company’s pricing model. They would know that it affects the business and therefore affects them. Then they would act accountable. I want to make that insight automated and make it more trivial for engineers to fulfill data requests. The MDS is not addressing this problem at all, so as practitioners we need to withhold some of the attention that is getting spent on procurement and divert it to planning.
The primary theme of the Semantic Warehouse is that if you are purposeful in your design and you allow your downstream tooling to inherit requirements, then you can actually build a scalable, maintainable warehouse. If you plan requirements and owners ahead of time they will be there when you need them. This is the core message I’m trying to get across. You can use the tools that already exist in the modern data stack in a way that avoids a horrific experience for the data team.
In part two, Chad and I will discuss the roles played by contracts and the data catalog in the Semantic Data Warehouse.