This blog post highlights a way of thinking for data professionals or anyone dealing with data. If you have a data catalog in your company or are planning to introduce one, this post raises some questions you should consider.
Generally, being proactive is considered a good thing. You’re conscious about the future, and deal with things before anything bad could materialize. Being reactive is the opposite, it brings to mind problems that need dealing with, or incidents to resolve.
So what does it mean to be proactive when talking about data? You anticipate changes to the metadata, freshness, structure, and – this is the sneakiest – the meaning of your datasets. Being proactive about metadata means you don’t let descriptions get out of sync with the underlying data asset. If you have tests for your ETL jobs, you’re proactive about data freshness, since you don’t let data become late, keeping up all the downstream jobs. Structure changes can be similarly handled. A particularly hard nut to crack is being proactive about changes in meaning. Unless you have some prank-loving data engineers, who would swap columns around while keeping their names, changes in meaning are usually very subtle. They materialize in an update to what an ‘active user’ means, for instance. That alters the meaning of monthly active users, and can have far-reaching implications if you aren’t careful.
Being reactive about data usually means reacting to the above scenarios only after they happen – a buggy ETL job is merged, descriptions aren’t updated, etc. So far this all seems like a setup to promote a proactive approach to everything, right? Well, yes, with a caveat. Expanding the proactive measures you take is important, but you can’t neglect the reactive side of the same coin, because mistakes always happen. So it’s important to have good reactive processes in place, otherwise you won’t be able to manage incidents in a timely, reliable manner.
Metadata
Let’s get to data catalogs finally. There’s been a shift over the past few years from people-oriented to automated, and with that, from reactive to proactive. Older data catalog products often required a dedicated data steward within the organization, since data entry was mostly manual and error prone. As organizations shifted to make each data producer responsible for the metadata for their own datasets, the newer catalogs have gotten a lot more automated. They gulp up as much data as possible, and connect them across different systems. Stemma is able – for instance – to resolve the tables that your dashboards depend on, and link the two together, even though one comes from a dashboarding product, and the other from the data warehouse.
This automation is pointing in the proactive direction, since it eliminates manual (and duplicated) effort, and with it reduces the chance of things getting out of sync. There’s always a human element however, since those descriptions don’t write themselves. So how do you ensure that dataset descriptions are kept in sync?
The simplest, and usually best solution is to store descriptions in your source system, and just publish them as your data catalog. This is sometimes called shift-left. On a practical level, it means using the comment field in your DB or data warehouse. If you have a Protobuf schema or JSON schema for your events, you can take the field descriptions from there.

However, this presents a problem – it excludes non-technical users who might have a better top-down view of the data and might use it more actively. This can be okay depending on your organization – if you have the tooling and all data consumers are able and willing to contribute back to the source descriptions, it’s fine. But the majority of cases in the real world are more complicated than this. How do you deal with this?
In Part 2 I will answer this question as well as talk about how to handle changes to datasets and, importantly, take a closer look at reactive approaches for finding and fixing errors.