A document from Meta privacy engineers, which was recently leaked, gave clear insight into the need to be able to capture and track metadata as data moves through systems and pipelines. The main goal, spurred by regulatory changes, is to be able to apply a policy to data when it is created and to pass that policy, along with the data, to all downstream subsystems so that consumers of the data can read the policy and determine whether or not a specific user’s information can be used.
While the sheer scale of Meta (350K APIs to update, some model features with 6k upstream tables!) certainly exacerbates the challenges, the fundamental problems articulated in their document will likely resonate for data leaders across all industries. In fact, there is a quote from the author(s) that perfectly sums up this need:
If we can’t enumerate all the data we have - where it is; where it goes; how it’s used - then how can we make commitments about it?
I’m certain that this is something you’ve thought about before. But is this even attainable? The use case I hear most often is “at any point in time can I accurately answer the questions of where is my data and how is it used?”.
This is a far cry from the complexity that Meta describes: filtering out user information at the row level for anyone who has opted out of personalized ads. But even so, getting to the point where we can answer basic questions about how data is being used must be a Herculean task, right? Meta defines several imperatives for achieving this goal, each of which has direct parallels to how we approach this problem at Stemma. While I won’t go as far as suggesting you invest the same amount of time as Meta to solve this problem (they estimate 750 Eng years), we can glean insight for how to apply the following two principles in a way that will seamlessly fit into our data culture.
1. The human factor: people are your data culture and no amount of automation can replace having high-touch human interactions in the right places. Applying policies to your upstream data assets is one of those critical spots, as seen here in Stemma:
2. Propagating the policy: data moves between systems and tables, it’s used in models, transformed in pipelines and sent to vendors, among other things. This is where automation is your friend; your catalog should help you propagate your policies, ensuring the right downstream assets receive the right policies. Stemma uses existing lineage information to enable reporting across your assets; for example:
To put this more simply: if you invest a little bit of time to properly curate your policies Stemma can propagate them downstream and help you report on them.