Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
close icon
November 9, 2022
August 30, 2022
min read

The Semantic Data Warehouse Part 2: Entities, Contracts, and Role of the Data Catalog

Mark Grover
Co-founder, CEO of Stemma
Share This Article

In this follow up to part one of my conversation with Chad Sanderson, Head of Data at Convoy, I ask him some more questions about his recent post regarding the Semantic Data Warehouse. Here Chad shares a lot of forward thinking as well as some concrete examples from his own experience.

The TLDR for me is:

  • Any business that is too complex to rely on well understood metrics should spend more time semantically defining its real world entities
  • Analysts need a better way of standing up for their needs and working with production and this can be facilitated by contracts
  • Data catalogs can go beyond data discovery to help analysts understand entity lifecycles

But, as always, I’m interested in what y’all think. So feel free to reach out. Here is the rest of my conversation with Chad.

Mark: You have talked a lot about creating a shared ground truth about the business for data users by establishing semantics in the warehouse. Data is obviously complex at companies that must manage large data sets. At a smaller company, though, is this process too constraining for analytics? Is the value from the Semantic Data Warehouse reserved for companies with a mature and complex business model?

Chad: Yes, good question, having a complex business is one big reason to adopt the Semantic Warehouse. There are other reasons as well, for example your financial data flows into Snowflake or, you have machine learning built on top of your warehouse. But, when your tech is really intersecting with the real world, it is really important to take this approach. If you are an Internet-only company with a relatively small user surface that people click on to buy things, you might be using fairly standard metrics to measure your business. But, as soon as that user surface expands and your business goes beyond that webpage to be affected by logistical issues, and “acts of God,” you will need to understand the lifecycle of entities via semantics in order to accomplish anything.

“If we don’t have amazing semantic modeling in the warehouse, not only is it very difficult to do analytics, it’s really difficult to just understand how our business works.”

Take Convoy as an example. We have quite literally hundreds of entities, including shippers and shipments, payouts, dedications, trucks, carriers, facilities, and many other entities that are interacting with each other all the time. If we don’t have amazing semantic modeling in the warehouse, not only is it very difficult to do analytics, it’s really difficult to just understand how our business works. Back to the original theory for the warehouse, it needs to reflect the real world. That imperative becomes more understandable in a business like Convoy’s or Airbnb, Lyft, Flexport, and many others.

It is extraordinarily important to have a data contract in these cases because there’s so many crazy things that could happen. I’ll give you one example at Convoy. Our payouts process is quite complicated. It starts when a shipper submits an RFP. We compete for that RFP but we don’t know if we are awarded the shipment until months later when they actually release that shipment into our marketplace. For payouts, we have to figure out whether we had bid for the shipment as it enters our marketplace. We also need to know if we signed a contract with this particular company and if there’s a contract price we should refer back to. Or, if we did bid for it, we need to determine whether we should use the price from the bid, or instead bid a different price based on the current market conditions. And we have to be able to audit the history of this payout and show how we came to the number for compliance reasons. Having this defined and reliable is very valuable to the business. If our data quality is low and we cannot confirm the data coming from production, then we cannot answer this question accurately and that can create big problems. This kind of process is happening all over the company.

“If you want to build a pricing model that is going to generate $100 million a year for your company, you should not be allowed to do that off of a dump.”

But to your question about the Semantic Warehouse being too constraining, it does not preclude dumping. That still happens, it’s just not respected as being remotely sufficient for analyzing the business. If you want to build a pricing model that is going to generate $100 million a year for your company, you should not be allowed to do that off of a dump. And on the flip side, if you create data requirements that call for work from Production, those need to have a contract with engineering.

You can still do exploratory analytics and maybe discover some interesting things. But if the only use case for data in the warehouse was messing around and looking for some fun trends then there’s less of a reason to create the Semantic Warehouse. If you need to understand how to competitively run your business, this is the way. You can even do all of your financial reporting in the warehouse like we do at Convoy. When the data can be trusted for something this critical, the warehouse becomes like a production system. But to get data that is trustworthy and always on time you need to treat it as a product.

Treating data as a product is a popular concept. It has been a central part of the Data Mesh discussion and now Snowflake is literally helping teams turn data into a product in its marketplace. But, teams are still struggling to implement the idea. What have you seen that keeps them back from productizing data?

A point that I frequently make is that a lot of analysts have Stockholm syndrome. They are convinced that being held captive to production is best for the company. Instead of pushing back and asking for the data they need, they let Product come up with the actual requirements and make do with what they get. But they inevitably accommodate themselves into a corner. They say “we’ll make it work, we’ll make it work,” and then the warehouse becomes such a mess that they can’t do their own job effectively and the blame is pinned on them.

At a lot of companies, the general attitude is that data is “exhaust” generated by services that can be captured and used for basic analysis. This can work to a point but, people outside of data do not really understand where and when that process needs to adjust for complexity. Ad hoc questions from the business become more complex on the data side as they become more important to the business. The warehouse can do more to handle this, it can drive critical financial and product decisions. But to do that, the warehouse needs to be respected more like production.

To bring more resources to data, analysts need a better way to ask engineers for the things they need. This is where the contract comes in. The best approach is to represent it as a schema that is tied to the semantic meaning of what you are measuring. You need a way to show what entity it is applied to, what data is being collected, what does each column mean and what do the attributes within each column mean. Even if the engineer isn’t going to become an expert on this, the data team needs to show they have a plan that will benefit the company if what they want.

Data contracts are best expressed as schemas to make sure analysts get the data they need. Image courtesy of Chad Sanderson

Does there need to be a dedicated role to shepherd this process? Will we need a Data Product Manager who is writing data user stories and data requirements documents, or contracts, for Engineering?

I think the role of Data Product Manager is going to be key to making contracts work. We want engineers accountable to the data contract the same way they are accountable for features. The Product Manager would write the contract the way they would write a product requirements document. If the data breaks and doesn’t meet the contract, then it is a “bug” and it’s the engineer’s responsibility to fix. I think this will be common practice in five to ten years and accepted by product engineers. It has to start from the CTO to take off in an organization though. They need to acknowledge that data coming out of production can be valuable to the business and that dumping it over the fence can lead the exec team to make the wrong financial decisions.

Most businesses can find questions with inherent value that will motivate people to adopt a contract. For example, at Convoy, when a shipment gets canceled, it can either be attributed as the fault of the shipper or the fault of Convoy. If we are not collecting the appropriate data, we cannot determine who is at fault and then cannot modify our billing so that we don’t have to eat that charge. This is something that happens routinely, but some companies just eat the charge regardless. That is something where you can immediately tie revenue or cost savings to the effort and show how creating a contract, similar to a PRD, will help implementation.

We do not have a dedicated Product Manager for data at Convoy, but analysts and business intelligence engineers are serving that function. So when they’re trying to create a dataset, instead of writing a lot of crazy SQL, they are trained to connect with the engineer who owns the service that generates the data. They’ll say, “Here’s the schema that I actually need, here’s how I’m going to use it, and here’s how this is going to benefit the company.” It’s just like you would submit any other PRD. If it’s a very small change, like you’re just asking for a single property, the engineer can knock it out. If it’s a much larger change, like needing a whole new database or creating twenty events, then engineering will prioritize that like they would a feature request.

With context and discovery being so central to your idea of the Semantic Data Warehouse, where do data catalogs fit in and how do analysts self-serve an understanding of the models?

The data catalog is still very important. I frequently talk about my experience onboarding with Amundsen and Stemma and how incredibly valuable that process is to a data team. I do think it will need to evolve for the Semantic Data Warehouse. When an analyst does data discovery it’s a highly multi-dimensional workflow. They do not just want to see all the tables within the warehouse, though that’s the starting point, which Stemma, like Amundsen, makes very easy. In the next step, the analyst will wonder what columns mean, who created the table, what is its history, and even trace it back further to the service to understand how it was generated. But what they need to understand from the beginning of this process is the entities that are core to the business.

Let’s say I care about tenders, for example. I don’t just care about a tenders table, I actually care about how the tender lifecycle works. Once I understand how tenders fit into operations and what other entities it interacts with, then I can get data from a particular part of that lifecycle and drill down to the service it is generated from. Once I have my trustworthy data, I can build a dashboard or whatever else my goal might be. But without that first part, the rest of the process is completely offline. To understand the lifecycle, the analyst asks questions in Slack or in meetings and it can take weeks for them to get up to speed.

With semantics the data catalog will become much more than just an index of data available in the warehouse. You’ll be able to map semantic meaning to entities the business cares about and break that into lifecycles. As an analyst, I want to understand the lifecycle so I can figure out where in that lifecycle I need to get my data from. And then from there, I move to the next abstraction down which is services. All of these questions are answered by the contract which tells you what data is coming from which service, and what the service is doing. It shows you the software engineers who actually own the service so you can just reach out to them directly if you have questions. Here are the entities and events in that data, so you can drill down to the core tables and traverse the lineage tree to understand where the data is going and find valuable transformations that you can put into your model. This is the way people think about exploring. It always starts from semantics and then can move on to what you are trying to find whether it’s a source or a metric.

The data catalog is in an ideal position to handle contract management. Image courtesy of Chad Sanderson

I also see the catalog in the Semantic Data Warehouse becoming a layer through which you can push contracts. Those contracts contain the associated metadata of the requirements and data associated with the services themselves. And so the catalog provides a layer of abstraction where you can browse the semantics or the schema-level data, and then the warehouse-level data, and move between those easily. This aids the three workflows for analysts: finding the data they need, managing the contract with Production, and leveraging datasets downstream in the logical layer to produce business intelligence metrics or features for machine learning.

In part one, Chad and I discuss the benefit of defining semantics in the warehouse and the impact of the MDS on data modeling.

Share This Article
Oops! Something went wrong while submitting the form.

Next Articles

November 9, 2022
June 21, 2022
min read

Balancing Proactive and Reactive Approaches to Data Management - Part 1

Data management is best handled by balancing proactive and reactive methods

November 9, 2022
October 7, 2021
min read

3 Steps for a Successful Data Migration

Learn the 3 crucial steps that great data engineering teams follow for a successful data migration.

November 9, 2022
March 9, 2022
min read

Stemma vs. Amundsen

Learn how Stemma is different from Amundsen