One of the early questions that data engineering teams pose when implementing a catalog is: should we make the catalog responsible for gathering metadata from data systems ("pull"), or task data systems with reporting metadata to the catalog ("push")? And, what are the consequences of using one approach over the other?
We believe that there's actually far less choice in this matter involved than is commonly believed. Every enterprise data catalog has a mix of push or pull data sources, and it's rarely an open decision whether to use push or pull for a particular source.
Data catalogs combine metadata from a ton of different sources:
How do you get all this information into your catalog? Does it matter what method?
To focus on a specific example, with Snowflake, you can either query information_schema every few hours ("pull"), or you can get DML queries sent to you (e.g. ALTER TABLE foo ADD COLUMN bar ("push")).
This is a useful example because both push and pull are available. That allows us to examine the consequences of each method. But it's also interesting in that this is one of the only data sources where optionality is even possible: the vast majority of sources support only pull.
There are three possible benefits that are proposed as reasons to prefer push:
- Timeliness: your catalog will always be up-to-date.
- Change-awareness. This is often associated with push, but as you'll soon see, the connection isn't actually very robust.
- Organizational: better supporting federated ownership of metadata assets
For teams that are starting to democratize data within their organization, the biggest challenge is usually understanding what data is where and how it's used. However, for teams that have found success in distributing that information through the organization reliably, their attention often turns to understanding how their data systems and usage change over time.
Thus, this capability of knowing and understanding metadata evolution is critical for advanced data organizations. "Change awareness" means you can use the catalog to understand how a table has evolved over time. You can see what recent changes your teammates have made to the Airflow job that failed and paged you. You can understand why the dashboard that was fine last week is producing wildly different results this week.
Push is often conflated with change awareness, however push vs pull has little bearing on whether your catalog is change aware in a practical sense. Let's go back to that Snowflake example. Snowflake experts might have noticed that in this example, there's actually a third option: periodically reading the query_history view and parsing DDL commands. That would be nominally a "pull" method since the catalog is initiating the transfer, however it uses the same information source as the "push" example.
This reveals a less-obvious category for data sources that's notable for change awareness:
- Snapshot extraction: the catalog processes an entire snapshot from the state of the source. It can diff the new snapshot against its current snapshots, store the delta. It's limited in how granular of changes it can detect: it cannot detect any changes that happen in-between runs, which practically buckets changes into 3-12 hour groups.
- Delta extraction: the source tool keeps a structural change-log (e.g. WAL or DDL log), and the catalog replays those to calculate state at any point in time. This is to metadata what change data capture is to data.
Most push-based sources are delta-based, while most pull sources are snapshot based, but that's not a hard rule: there are exceptions in each direction. What's critical is that of the types of data commonly found in a catalog, very few actually support delta extraction: pretty much just databases (due to WALs or SQL logs) and tools using git (e.g. Flyway or Airflow). Because most sources only support snapshots, the catalog must be quite good at inferring deltas from snapshots to be comprehensively change-aware.
How does this impact users? If you're a data analyst trying to understand if a table has changed recently, and you check the catalog, your user experience would likely be about the same. You can see that the table has changed a few times over the past few days. Whether or not your catalog was backed by snapshots, deltas, push or pull, you got the info you needed.
Regardless of snapshots or deltas, each has the same usability problem: they capture what happened, not why. In the above example, wouldn't you much rather see a git commit saying “Renamed the zipcode column to postal_code for i18n migration” (possibly with a link to the JIRA ticket with more context), rather than the ALTER TABLE SQL itself?
Most catalogs fail because they don't have enough metadata sources attached, or because the catalog isn't smart enough to infer connections between different data sources, not because individual sources weren't granular or timely enough.
Setup and Maintenance costs
Pull based systems are easy to get going. For a cloud catalog, you just need to put in your credentials and make sure your networking permits the connection. OSS catalogs will either come with an inbuilt scheduler or will require you bring your own ETL system (like Airflow). Generally, no changes to the underlying source systems are required.
Push systems require configuration changes to your underlying data systems: for example you need to configure your database to emit change logs. Then, you will need to run an agent to sink that data into the catalog. You also need to do an initial pull of a snapshot of the metadata to seed the catalog.
The biggest ongoing maintenance burden with any metadata system is ensuring that its connections to data systems are uninterrupted.
Push systems' Achilles heel is generally infrastructural: push systems require high-quality reusable streaming infrastructure building blocks: think easy creation of Kafka clusters or topics, monitoring of consumers, CI/CD of consumers. One pernicious challenge is that sometimes the streaming outlet of data systems can impact their stability: e.g. if your Postgres write-ahead log listener goes down, Postgres itself may run out of disk.
Pull systems' weak point is typically organizational, in the sense that keeping credentials or networking rules up to date can be a challenge since the team making changes may not prioritize the catalog. This typically starts to become a problem at organizations with 10,000+ employees due to federated ownership and the high number of discrete systems deployed.
For most sources where steady state will be covered by push-based updates, you will need to pull snapshots to seed the catalog initially. However, you will also want to retain this capability on an ongoing basis, to allow you to recover from any downtimes where changes were missed. In a change based system, if you miss a single delta, your catalog may permanently be incorrect without intervention. This can't be stressed enough: for every push-based source, you always need to have pull fully configured and ready to go. This means that the maintenance burden of push-based systems is usually a superset of a pull-oriented system.
When to use what?
Well, let's start with the easy cases: situations where you are forced to use one method. You must periodically pull data out of any systems that don't expose streaming delta logs:
- BI tools
- ML feature stores
- Nearly all SaaS tools
If the data in question uses the catalog as the historical source of truth (meaning it's not stored anywhere else), it must be pushed. Sometimes CI/CD or data quality runs will fall into this category.
For the few sources where you have the option to choose between methods, like databases and warehouses, info in git (e.g. database migrations, Airflow jobs), or many ETL run results, we suggest using this approach on a per-source basis:
- Start with pull. You need this to seed the catalog initially no matter what. But even if you do end up adding push later, you'll likely need to retain pull capability in case of outages that interrupt the pushed delta log.
- Wait and see if timeliness of metadata causes practical problems for users. If you survey and interview data catalog users, it's likely that it won't be a top problem they're facing.
- However, if timeliness does prove critical for your users and you can't increase the polling time enough, implement push. Because of the additional maintenance and setup costs, you should get assurances upfront that the benefits are strong enough to make this ROI-positive.
It's important to note that the only factor that matters here is timeliness, because typically change awareness won't be seriously affected by push vs pull. So it's worth digging into timeliness a bit further.
Metadata is consumed by both humans and computers, and timeliness is quite different between those audiences.
For humans, near real-time data surfacing of operational information like data quality results, or ETL job runs is extremely helpful when debugging data problems. However, things like table structure, dashboard contents, or ETL job configuration typically don't change often enough that a few hours delay will impact your understanding and trust of the metadata.
If you start to use your catalog for operational use cases, e.g. reading a Sensitive tag to limit access to certain users within an organization, you may need to react with very low latency to changes. It would be unacceptable for a table to get an email column added with a PII tag, but the access control system lags behind by 12 hours, allowing unfettered access to the underlying data in the meantime. Similarly, stale data can undermine using your catalog in your CI/CD pipelines.
For most catalogs, the path is fairly clear: start with pull, and possibly layer on push from the few data sources that support it and give benefits for your users. And: don't worry about it too much! There are probably plenty of other hurdles that will hold you up before the data transfer method.
At a very large scale (usually companies with 10k+ knowledge-worker employees), more of the challenges associated with deploying a centralized data catalog are associated with organizational complexities, not technical challenges. For example, consider a large organization where individual teams own their own infrastructure, and a centralized team owns the catalog. There may be dozens of such teams owning their own data infrastructure across separate business units. In cases like this, it makes sense to take on more technical complexity in order to abstract over this organizational complexity.
Specifically, it may make sense to have a two tier architecture, where the central catalog only accepts push data. Then the local teams would source data using a mix of push and pull (using the above guidelines), calculate deltas locally, then push deltas to the central catalog. This creates a very strong interface boundary between the central catalog and individual teams. However, it incurs quite a lot of overhead: each team needs to operate software to push metadata, there is much more surface area for problems to arise, etc. Thus the central data team needs to spend quite a bit of time educating teams, creating standards, helping QA teams' metadata, etc. If those fixed costs are spread amongst a very large team, it can be quite productive. However, for most companies, this approach is quite wasteful and can very likely cause the catalog to fail.