This post continues our look at what it means to be proactive and reactive with respect to your data pipelines and metadata. I give advice on how to manage their trade-offs and get the best of both approaches.
I ended Part 1 saying that moving dataset and dashboard descriptions closer to the source helps keep them more up to date. As a reminder, here’s the high-level diagram of what this looks like:
However, if you’d like to enable less technical people at your organization to contribute their knowledge back into the catalog, this setup often excludes them from doing so. This can create a ghost town when business users who have useful information about a dataset realize they and their peers cannot improve the catalog, and then have less incentive to come back to a tool that will not adapt to their needs. Introducing a metadata layer that’s editable from a UI helps solve this problem:
This can be accomplished by a tool like Stemma, or even a headless CMS like Netlify CMS, which lets less technical users commit their changes to Git when editing descriptions. However, a setup like this has a greater chance of getting out of sync with the underlying data, because now the source of the data and the source of its description live in two separate places. You need strong (automated) processes or a culture of keeping descriptions updated for this to work.
Moreover, in real life things aren’t usually this simple. You might have Snowflake as your data warehouse, perform transformations using DBT, and maybe even want to plug data into your catalog from an external system through an API.
When your metadata comes from several different places, the descriptions can occasionally contradict each other.
To resolve this, you might approach the problem in a few different ways.
- You can display everything. Stemma supports this but we find it's usually not the way to go for our customers. Although it is the simplest method to implement, it makes the UI more complex, and makes it harder to treat the data catalog as a source of truth. Imagine, for example, if the description coming from DBT contradicts the one added from the UI.
- Establish a resolution strategy. This can be a static priority list (which is what we use by default at Stemma), where you rank your metadata sources according to their trustworthiness, and if a high priority source has metadata, you only display that. When it doesn't, you can fall back to lower priority sources.
- You can try more complex resolution strategies, such as one based on freshness or user roles. However, the more fancy your strategy is, the harder it is to understand, so we recommend sticking with a clear priority list.
The reactive side of the same coin involves handling descriptions gone awry. This can be done through change management (more on that in the next section), and by surfacing missing descriptions.
Having an overview of how many descriptions are missing (per database or even more granularly) helps you fill in the gaps when the proactive approach fails. You can use this overview to understand how healthy your catalog is. Also, if you want to create a good experience for all users, it can help you find remaining datasets that need to be documented.
In any reasonably sized data infrastructure, changes can take many forms and can be painful to detect and debug. This is especially true if multiple layers of your data architecture have information about your datasets’ structure. This is when having reliable lineage becomes important.
You can manage changes in a proactive way, by notifying downstream users ahead of time that a table’s structure or meaning is about to change. Stemma allows you to view a table’s lineage, and notify the owners and frequent users of those tables that a change is happening.
Lineage also helps a lot on the reactive end of the spectrum. When a bug is introduced, you can trace it back through the upstream tables, and the dependency graph tells you where to look.
There are other proactive ways a tool like Stemma can help you manage changes:
- Use a shared glossary for common terms.
For instance, if the range of values or the meaning or ‘user_id’ changes, you don’t have to manually update every table that has this column.
- Link table and column descriptions.
This feature works similarly to the glossary, but without creating an explicit definition. This way you can have a ‘source of truth’ table, and all tables deriving from that can share its descriptions.
Proactive change management is important for metadata, not just the data itself. Handling it in a DRY (don’t repeat yourself) fashion is one of the best ways to minimize human error and make sure your descriptions aren’t out of date.
And finally, much of this depends on your organization’s structure – do data scientists write ETL jobs? Do you use DBT? Are only data engineers responsible for ops, or is everyone responsible for the data they produce? You can only set your data catalog and discovery up for success if you tailor it to how your company works.
- The reactive way of handling things has its place, and that is after the proactive approach fails. You should do everything to widen the scope of what you can do proactively, but have reactive processes in place to resolve incidents.
- Keep all the metadata you can, but be conscious of the potential for conflicting sources and have a method for resolving them.
- Track lineage for both proactive and reactive approaches to change management. The more clearly you can communicate with downstream users, the better you can spare them from nasty surprises. Lineage can also help automate documentation or reactively debug issues that inevitably occur.