As a former data engineer, one thing I know all too well is that data is frequently in a state of change. When you manage a data product you need to keep your data up to date to meet the needs of your consumers while optimizing for non-functional requirements (e.g. reusability, cost, performance, etc.). At a tactical level, this means updating the formats of your tables by adding or removing fields, adding or deprecating the specific values that each field can have, and changing the frequency and cadence that your data is updated.
The interconnected nature of data products– with multiple products often chained together to build upon each other – means that even the smallest change in a pipeline can ripple downstream to have significant impacts that materially affect your business. I’ve recently been using the term “Micro Migrations” to describe a scenario in which a change is made in a data product and all downstream assets are updated to take into account the new logic or process associated with the change. Micro Migrations are different from normal migrations in two ways:
- By definition they are, well, smaller. We’re talking about a small effort by a single person or team over one or two sprints, not a lift-and-shift for the entire data warehouse from Athena to Snowflake.
- They are frequent. These migrations occur often and are a typical part of life for an analyst or a data engineer who is actively maintaining their data products.
Here are a few Micro Migration use-cases that we see frequently at Stemma:
- An analytics engineer builds a table using dbt, that table becomes extremely popular and several dashboards are built on top of it. As time passes, it becomes well trusted and the logic the analyst built becomes a source of truth for calculating certain metrics. However, the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer partner together to complete a micro migration to a new, more performant and reusable table.
- A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are placed into a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to their output table which contains the product type that churn is being predicted for.
- An analyst created several views to consolidate the common logic that they were using to query tables in their dashboards. The views, however, had very poor performance as the amount of data grew. The analyst and the engineer worked together to create materialized versions and to update all of the dashboards using the prior views.
By far the biggest challenge with micro migrations that we hear about at Stemma is that it is hard to accurately curate a list of all assets that will be impacted by each change. If you cannot be certain of what will need to be changed or what could break, how can you provide risk analysis or even a time estimate?
Change management with data is hard. Data still lacks many of the primitives that are now deeply ingrained in software development:
- Building a development ecosystem that allows you to accurately test data products before launching them is expensive
- Data versioning is still a nascent idea (when compared to software package versions and version control systems)
- Data contracts cannot be described by their interface alone: the distribution, skew and values contained within a data set of data are just as important as the structure of the data
While each of the problems above - and many more - are receiving in-depth attention from both open source communities and proprietary teams, I believe a truly robust process for managing micro migrations must revolve around the data catalog. Your data catalog is the central repository for aggregating and collating the metadata about your data ecosystem.
I’ve worked with, interviewed and learned from many dozens of teams and there are three activities that are common to all of those that are able to consistently navigate micro migrations with a high degree of success. Not surprisingly, the data catalog is the only tool that has the information to repeatably and reliably enable these successful migrations.
- Consistently evaluate priority data assets, their structure, usage and future utilization
As a precursor to any micro migration, building an inventory of how your data is used will help you to take action on the right asset at the right time. Keeping an active tab on your data will prevent large, unwieldy ecosystems of intermingled and duplicated content. The catalog should be telling you who is using your assets, how they are being used, what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.
- Clearly define plans for both the identification and migration of assets
This should be obvious, but the challenge here is creating an efficient and comprehensive methodology. Your data catalog should have the broadest view of your data ecosystem and should understand who is using what, where data is moving and when events occur. Just as you query your catalog to find important tables you should be querying your catalog with questions such as “What tables and dashboards will I need to update if I rename this view?”.
- Proactively communicate expected changes
It’s not enough to have your ducks in a row, everyone else needs to know what is going to change and when. Perhaps you remembered to inform your partner analyst who keeps the executive dashboard up to date about the impending change, but did you forget to tell the data engineer who builds data cubes off your table? This is another example of how your data catalog should aggregate metadata so as to simplify your work, while increasing your confidence in the outcome.
Your data catalog should be more than a pane of glass. Use the catalog as a central tool that has all of the information you need to help curate your data roadmap and execute your micro migrations, hold team members accountable for creating micro migration execution plans, and leverage the catalog’s knowledge of who is using your data to keep all stakeholders involved.