Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
close icon
September 14, 2021
August 16, 2021
-
min read

Defining Data Ownership

by
Mark Grover
Co-founder, CEO of Stemma

Photo by Jack Carter on Unsplash

Share This Article

In the first paragraph of a post I had written earlier this month, I referred to data engineers as producers of data. Someone immediately replied and something to the extent of, "You lost me at the first sentence. Data Engineers can't be data owners."

Clearly, I ruffled some feathers, and I have done the same plenty in the past whenever I talked about ownership of data.

Let's say your company uses Segment to send events from its website to its data warehouse. Each Segment event, by default, creates a table in the warehouse. You, the data engineer, have created a derived view that aggregates all of these events into one table to make it much more accessible and consumable. A few days later, an analyst comes to you and says, "This view no longer has up-to-date data for any event." And then asks the dreaded question: "Who's the owner?"

In this case, the answer is somewhat obvious - the person who created this derived view, i.e. you — the data engineer.

But what if the question being asked was slightly different:

  • What action on the website triggers this event to happen? OR,
  • What does a particular item in the payload for place_order event mean?

It's obvious that data engineers can't be held responsible for answering such questions.

It turns out that who the data owner is depends on the question being asked.

Here are some common situations where someone could be asking the ownership question:

  • The data is stale. Who should triage and fix it?
  • The data is "wrong." Some fields are nulls or malformed.
  • What does this particular field mean?
  • Can this data be used for a new ML model that will show a different price to different people on the website?

Ultimately, ownership of data can be divided into 3 categories:

  1. Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
  2. Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
  3. Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can't be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.

Aside — Quality owner: This may depend on organization, but I've found there to be a general resistance to take ownership of end-to-end quality of a data set. Why? Because data engineers don't consider themselves owners of the data that's produced from the upstream application and can't be held responsible for hunting down a website bug that impacts the data in the warehouse. The product engineers don't have enough context about how data gets joined and transformed downstream to own the final derived data artifact as their output. This may change as decentralized data management (aka data mesh) is deployed more broadly.

There's a tendency in some organizations to build "shared ownership." This fails to work well in practice. The only times I've seen it work well is when there's a shared understanding of different facets of ownership within this shared owner group and the group is able to redirect to the right person within the group based on each question.

In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. An issue (patient) comes in, they triage to see what's going on. Sometimes it's a problem they can fix, so they fix it and resolve the issue (injury). In other cases, they redirect to the appropriate owner (a different health practitioner). As you can see, the issue (or question being asked) informs who the owner will be. 

Lastly, ownership does not need to be articulated at this level of granularity for all of your data. Data and organizations are ever-changing and any information around ownership is bound to become stale. What's important is to have high quality granular ownership for the most important 20% of your data and have good defaults for the remaining 80%.

Organizations typically use a data catalog to document ownership information. For the most important 20% data within your organization:

Metadata and related owners

Image by Author: Metadata and related owners

For the remaining 80% of data in your organization — automation. Frequent users are a good proxy for domain owners. And, the person who last changed the ETL code or gets alerted when the ETL job fails is a good proxy for the delivery owner.

So, the next time someone asks you about ownership, ask yourself: Are you looking for the delivery, domain, or policy owner?

Want to learn more about Stemma’s fully managed data catalog? Reach out to Mark Grover or the team at Stemma.

Thanks to Chris Riccomini for reviewing a draft of this post.

Share This Article
Stay in the loop by subscribing to our newsletter
Oops! Something went wrong while submitting the form.

Next Articles

September 15, 2021
September 15, 2021
-
min read

Data Discovery in Data Mesh

Why is data discovery important? What is the role for data discovery in data mesh? Who's responsible for making data discoverable? Learn the answers to these questions (and more!) — summarized from a recent panel discussion on Data Discovery in Data Mesh.

October 4, 2021
October 4, 2021
-
min read

Making Sense of Metadata Ingestion

One of the early questions that data engineering teams pose when implementing a catalog is: should we make the catalog responsible for gathering metadata from data systems ("pull"), or task data systems with reporting metadata to the catalog ("push")? And, what are the consequences of using one approach over the other? Learn how to ingest metadata into your catalog and which method to choose.

October 7, 2021
October 7, 2021
-
min read

3 Steps for a Successful Data Migration

Learn the 3 crucial steps that great data engineering teams follow for a successful data migration.