Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
close icon
November 9, 2022
May 25, 2022
-
4
min read

A Lesson in Data Policies from Facebook

by
Grant Seward
Founding Engineer
Share This Article

A document from Meta privacy engineers, which was recently leaked, gave clear insight into the need to be able to capture and track metadata as data moves through systems and pipelines. The main goal, spurred by regulatory changes, is to be able to apply a policy to data when it is created and to pass that policy, along with the data, to all downstream subsystems so that consumers of the data can read the policy and determine whether or not a specific user’s information can be used. 

While the sheer scale of Meta (350K APIs to update, some model features with 6k upstream tables!) certainly exacerbates the challenges, the fundamental problems articulated in their document will likely resonate for data leaders across all industries. In fact, there is a quote from the author(s) that perfectly sums up this need:

If we can’t enumerate all the data we have - where it is; where it goes; how it’s used - then how can we make commitments about it?

I’m certain that this is something you’ve thought about before. But is this even attainable? The use case I hear most often is “at any point in time can I accurately answer the questions of where is my data and how is it used?”. 

This is a far cry from the complexity that Meta describes: filtering out user information at the row level for anyone who has opted out of personalized ads. But even so, getting to the point where we can answer basic questions about how data is being used must be a Herculean task, right? Meta defines several imperatives for achieving this goal, each of which has direct parallels to how we approach this problem at Stemma. While I won’t go as far as suggesting you invest the same amount of time as Meta to solve this problem (they estimate 750 Eng years), we can glean insight for how to apply the following two principles in a way that will seamlessly fit into our data culture.

1. The human factor: people are your data culture and no amount of automation can replace having high-touch human interactions in the right places. Applying policies to your upstream data assets is one of those critical spots, as seen here in Stemma:


2. Propagating the policy: data moves between systems and tables, it’s used in models, transformed in pipelines and sent to vendors, among other things. This is where automation is your friend; your catalog should help you propagate your policies, ensuring the right downstream assets receive the right policies. Stemma uses existing lineage information to enable reporting across your assets; for example:

To put this more simply: if you invest a little bit of time to properly curate your policies Stemma can propagate them downstream and help you report on them. 

Share This Article
Oops! Something went wrong while submitting the form.

Next Articles

November 9, 2022
June 21, 2022
-
4
min read

Balancing Proactive and Reactive Approaches to Data Management - Part 1

Data management is best handled by balancing proactive and reactive methods

November 9, 2022
October 7, 2021
-
min read

3 Steps for a Successful Data Migration

Learn the 3 crucial steps that great data engineering teams follow for a successful data migration.

November 9, 2022
March 9, 2022
-
min read

Stemma vs. Amundsen

Learn how Stemma is different from Amundsen