In the first paragraph of a post I had written earlier this month, I referred to data engineers as producers of data. Someone immediately replied and something to the extent of, "You lost me at the first sentence. Data Engineers can't be data owners."
Clearly, I ruffled some feathers, and I have done the same plenty in the past whenever I talked about ownership of data.
Let's say your company uses Segment to send events from its website to its data warehouse. Each Segment event, by default, creates a table in the warehouse. You, the data engineer, have created a derived view that aggregates all of these events into one table to make it much more accessible and consumable. A few days later, an analyst comes to you and says, "This view no longer has up-to-date data for any event." And then asks the dreaded question: "Who's the owner?"
In this case, the answer is somewhat obvious - the person who created this derived view, i.e. you — the data engineer.
But what if the question being asked was slightly different:
- What action on the website triggers this event to happen? OR,
- What does a particular item in the payload for place_order event mean?
It's obvious that data engineers can't be held responsible for answering such questions.
It turns out that who the data owner is depends on the question being asked.
Here are some common situations where someone could be asking the ownership question:
- The data is stale. Who should triage and fix it?
- The data is "wrong." Some fields are nulls or malformed.
- What does this particular field mean?
- Can this data be used for a new ML model that will show a different price to different people on the website?
Ultimately, ownership of data can be divided into 3 categories:
- Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
- Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
- Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can't be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.
Aside — Quality owner: This may depend on organization, but I've found there to be a general resistance to take ownership of end-to-end quality of a data set. Why? Because data engineers don't consider themselves owners of the data that's produced from the upstream application and can't be held responsible for hunting down a website bug that impacts the data in the warehouse. The product engineers don't have enough context about how data gets joined and transformed downstream to own the final derived data artifact as their output. This may change as decentralized data management (aka data mesh) is deployed more broadly.
There's a tendency in some organizations to build "shared ownership." This fails to work well in practice. The only times I've seen it work well is when there's a shared understanding of different facets of ownership within this shared owner group and the group is able to redirect to the right person within the group based on each question.
In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. An issue (patient) comes in, they triage to see what's going on. Sometimes it's a problem they can fix, so they fix it and resolve the issue (injury). In other cases, they redirect to the appropriate owner (a different health practitioner). As you can see, the issue (or question being asked) informs who the owner will be.
Lastly, ownership does not need to be articulated at this level of granularity for all of your data. Data and organizations are ever-changing and any information around ownership is bound to become stale. What's important is to have high quality granular ownership for the most important 20% of your data and have good defaults for the remaining 80%.
Organizations typically use a data catalog to document ownership information. For the most important 20% data within your organization:
Metadata and related owners
For the remaining 80% of data in your organization — automation. Frequent users are a good proxy for domain owners. And, the person who last changed the ETL code or gets alerted when the ETL job fails is a good proxy for the delivery owner.
So, the next time someone asks you about ownership, ask yourself: Are you looking for the delivery, domain, or policy owner?
Thanks to Chris Riccomini for reviewing a draft of this post.