For far too long, data catalogs have been overly focused on data users while shunning the needs of software engineers and, specifically, data engineers. The core features in all data catalogs - metadata capture, tagging, lineage, to name a few - are skewed to a UI-based search and discovery paradigm. Fundamentally, these capabilities support data users but offer relatively little value for data creators, which has led to two main problems with the data catalog:
- The data catalog has become a purely reactionary software
- The value-add for the data catalog scales linearly with the number of users
It’s time for the data catalog to evolve. Catalogs already have access to rich, cross-sectional views of your data ecosystem. The next frontier is to repurpose this information for operational use in ways that enable data creators to create and maintain data more efficiently. The most valuable use-case is to integrate the data catalog into the CICD pipeline.
Why integrate the data catalog with the CICD pipeline?
Your data catalog should have a rich set of metadata available to ensure continuity within your data pipelines. For example, the catalog should alert you on the following scenarios before the changes occur in production:
- Use lineage to understand if removing a field will impact a business critical or externally reported dashboard
- Combine lineage with data quality test results to determine if new “not null” constraints in your Avro files will lead to records not being inserted into your database
- Check query logs and dashboard view counts to determine that a table has not been used in the previous three months before dropping a table
The CICD pipeline is the control point that represents the moment when the shape and semantics of your data changes. This is the right place to inject your data catalog because it is before the actual data changes. The diagram below provides a simplistic example for a CICD process that validates metadata changes with the data catalog before deployment.
Since the data catalog sits in front of the deployment, it can now be proactive in its decision making. This has tremendous benefits for some features such as data lineage where the standard, reactionary, approach is to build lineage from SQL logs which can only be read after changes to your data have occurred.
Another benefit of this approach is that it has the side-effect of keeping your data catalog up to date.
Why is this valuable?
When utilizing the data catalog for operational use-cases, you are able to scale the value that your catalog provides with the amount of data that you have. Traditionally you could measure the ROI with a formula akin to:
(Number of data users) * (avg time saved per user) * (avg salary)
While this is a good baseline, it is single sided and does not take into account the holistic amount of time that data creators spend generating new data assets, and specifically the significant amount of time it takes to track down and fix data dependency issues after they occur. Consider the tasks below for when a breaking change occurs in production:
- Finding the root cause
- Fixing the root cause
- Planning and executing a manual data cleanup & backfill
The time cost here is tremendous; not to mention that when a data quality issue makes it into production, there can be a tangible impact to your user! Companies that have larger amounts of data can derive out-weighted benefits from operationalizing their data catalog.
Scaling with Data
In order for catalogs to scale with data, they must cater to the data creators: the software engineer. The engineer’s interaction with the data catalog will be different than what one would expect for an analyst or a data scientist. In particular, engineers will require:
- A robust set of programmatic interfaces that fits into their development lifecycle
- Integration with (or deployment of) the data catalog into dev, staging and prod environments
While providing an API for engineers that tells them lineage exists will work for some organizations, those with larger data sets, stringent change controls, more complex pipelines and larger teams will not find this to be sufficient. To win over most companies, the catalog will need to succinctly summarize all of the knowledge it has about related data assets within the APIs that are made available to engineers:
- Is the data being used for operational purposes or analytics?
- Is the specific column being removed used in a business critical report?
- When was the data last used?
Integrating your data catalog with your CICD pipeline will enable you to proactively leverage your metadata to identify breaking changes in your data pipeline and give your team the ability to scale with your data while supporting the increasing demands of a growing data set.
At Stemma, we’re bringing all of this together and reimagining what the data catalog can do to answer the tried and true question:
Will changing this data asset have an impact on my business?