Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
close icon
November 9, 2022
June 21, 2022
-
4
min read

Balancing Proactive and Reactive Approaches to Data Management - Part 1

by
Balint Haller
Software Engineer

Balancing Proactive and Reactive Approaches to Data Management

Share This Article

This blog post highlights a way of thinking for data professionals or anyone dealing with data. If you have a data catalog in your company or are planning to introduce one, this post raises some questions you should consider.

Generally, being proactive is considered a good thing. You’re conscious about the future, and deal with things before anything bad could materialize. Being reactive is the opposite, it brings to mind problems that need dealing with, or incidents to resolve.

So what does it mean to be proactive when talking about data? You anticipate changes to the metadata, freshness, structure, and – this is the sneakiest – the meaning of your datasets. Being proactive about metadata means you don’t let descriptions get out of sync with the underlying data asset. If you have tests for your ETL jobs, you’re proactive about data freshness, since you don’t let data become late, keeping up all the downstream jobs. Structure changes can be similarly handled. A particularly hard nut to crack is being proactive about changes in meaning. Unless you have some prank-loving data engineers, who would swap columns around while keeping their names, changes in meaning are usually very subtle. They materialize in an update to what an ‘active user’ means, for instance. That alters the meaning of monthly active users, and can have far-reaching implications if you aren’t careful.

Being reactive about data usually means reacting to the above scenarios only after they happen – a buggy ETL job is merged, descriptions aren’t updated, etc. So far this all seems like a setup to promote a proactive approach to everything, right? Well, yes, with a caveat. Expanding the proactive measures you take is important, but you can’t neglect the reactive side of the same coin, because mistakes always happen. So it’s important to have good reactive processes in place, otherwise you won’t be able to manage incidents in a timely, reliable manner.

Metadata

Let’s get to data catalogs finally. There’s been a shift over the past few years from people-oriented to automated, and with that, from reactive to proactive. Older data catalog products often required a dedicated data steward within the organization, since data entry was mostly manual and error prone. As organizations shifted to make each data producer responsible for the metadata for their own datasets, the newer catalogs have gotten a lot more automated. They gulp up as much data as possible, and connect them across different systems. Stemma is able – for instance – to resolve the tables that your dashboards depend on, and link the two together, even though one comes from a dashboarding product, and the other from the data warehouse.

This automation is pointing in the proactive direction, since it eliminates manual (and duplicated) effort, and with it reduces the chance of things getting out of sync. There’s always a human element however, since those descriptions don’t write themselves. So how do you ensure that dataset descriptions are kept in sync?

The simplest, and usually best solution is to store descriptions in your source system, and just publish them as your data catalog. This is sometimes called shift-left. On a practical level, it means using the comment field in your DB or data warehouse. If you have a Protobuf schema or JSON schema for your events, you can take the field descriptions from there.

However, this presents a problem – it excludes non-technical users who might have a better top-down view of the data and might use it more actively. This can be okay depending on your organization – if you have the tooling and all data consumers are able and willing to contribute back to the source descriptions, it’s fine. But the majority of cases in the real world are more complicated than this. How do you deal with this?

In Part 2 I will answer this question as well as talk about how to handle changes to datasets and, importantly, take a closer look at reactive approaches for finding and fixing errors.

Share This Article
Oops! Something went wrong while submitting the form.

Next Articles

November 9, 2022
October 7, 2021
-
min read

3 Steps for a Successful Data Migration

Learn the 3 crucial steps that great data engineering teams follow for a successful data migration.

November 9, 2022
March 9, 2022
-
min read

Stemma vs. Amundsen

Learn how Stemma is different from Amundsen

November 9, 2022
October 4, 2021
-
min read

Making Sense of Metadata Ingestion

One of the early questions that data engineering teams pose when implementing a catalog is: should we make the catalog responsible for gathering metadata from data systems ("pull"), or task data systems with reporting metadata to the catalog ("push")? And, what are the consequences of using one approach over the other? Learn how to ingest metadata into your catalog and which method to choose.