I recently joined an expert panel to discuss data discovery and how it relates to data mesh, moderated by Paco Nathan and co-organized by Data Mesh Learning and Open||Source||Data.
- Why is data discovery important?
- What is the role for data discovery in data mesh?
- Who’s responsible for making data discoverable?
- Does there need to be a promotion process to ensure data is accurate and trustworthy?
- When does data discovery reach its limit when handling PII?
Check out the video above to hear insights on these questions (and more!). The panel discussion is also summarized below.
Why is data discovery important? Why should we be looking at this specific area?
Finding and understanding data
You first need data access. But once you have the ability to access it, you need to understand what exists and where it is in order to start querying, analyzing, or making decisions based on the data. That is data discovery -- finding and understanding data. And this is a fundamental area that is important to data professionals in industry, whether you’re focused on data analysis, modeling, or manipulation of data.
With the rise of the modern data stack, more companies are able to collect more and more data. Over the last 5 to 10 years, we’ve done a lot of innovation in getting data to a centralized place. We’ve invested a lot of energy and infrastructure into centrally storing the data and then using tools like Tableau to do analytics. We have democratized access to this data in this way. And now, we’re not getting data purely from applications or websites, but SaaS tools, as well.
“People are hungry for data and now we have lots of it. Companies want to connect this data in one place, because people don’t know what data exists, why it exists, how it’s being used, where it is, if they can’t trust it, etc.”
Wider audience of data users
Data discovery is also emerging because it’s not just centralized data teams that are using the data, but other teams in your company, too, including product managers, ops analysts, sales analysts, engineers, etc. There is now a wider audience that’s searching for and using this data.
Hyperspecialization of tools
A new problem is now emerging -- there are too many tools. We don’t just have data in too many places, but now we have too many tools. Hyper specialization has created a problem where people don’t know where things are located. If you need to replace a certain tool, what will be impacted? Data mesh will accelerate this even more as domains receive more autonomy to decide their whole stack. This rise in hyper specialization makes it even more important that we focus on data discovery in order to understand where everything is coming from and where everything is going.
Ethics and responsibility
Using data in a sensible and ethical way is also extremely top of mind as an increasing number of personas and professionals start using it for decision making. With a wider audience, it can be assumed that not everyone has a strict background in understanding data and responsible ways of using it. Data discovery will enable people to find data more easily, understand it in a sensible way, and use it in a responsible manner.
ETL vs. ELT—can someone break those down?
The old school philosophy focused on transforming a lot of data in flight before loading it into a warehouse. What’s been emerging over the last couple of years is an emphasis on collecting all of the data first, and then transforming it later. These days, storage is cheap, data lakes are plentiful. Don’t throw away data in flight, copy it all and then transform it later. ELT has led to simpler and more operable pipelines.
Reverse ETL, especially with respect to machine learning. Can someone explain this?
It’s a relatively new category—major players like Census and Hightouch are focused on it. The basic premise is to take transformed data in your warehouse and stick it back into the systems that business users end up using. For example, let’s say a customer success manager is always in Salesforce. They can use reverse ETL to decide which customers to support that day based on a model that’s run in their Snowflake warehouse.
“There is a deep need in taking learnings from a centralized warehouse and putting them back into the hands and workflow of users who are making certain decisions.”
A core tenant of what we should be doing in the data space, especially with the fragmentation of various data tools, is taking insight and putting it into the flow of the user.
What’s the role for data discovery in data mesh?
Data mesh explained
Data mesh is all about splitting up your centralized data ecosystem into data domains and then applying really high quality data product thinking around how data is shared across those domains. Data discovery is a capability for enabling the metadata plane or control plane on the data mesh.
Central metadata fabric
One of the things that data discovery has initially attempted to do is scrape and understand log files in order to stitch together a picture of what you have. What’s really exciting with data mesh is the ability to have data products push metadata out from data domains into the central metadata fabric. This leads to much higher data quality experiences.
The way that we model, consume, and produce data is decoupled. Different domain owners should own and manage their own models. However, with decoupling comes questions like, “If I want to access data or a service that I didn't create but want to utilize, how will I find out or understand how to use that data or service without having to build it again myself?”
Data discovery allows you to maintain a source of truth while still having decoupled ownership. That’s the beauty of data mesh.
Data discovery for code
Most of data discovery is for humans, but when you start implementing for compliance, you have to start thinking about data discovery for code. There is a need to have real programmatic abstractions. For example, if you have a model ID, where is it? Need a way to find the model and determine the compliance tags so that you can apply the right policies before rendering. Our next challenge is applying programmatic use cases.
Who’s responsible for making data discoverable—people using the data or people creating the data?
The onus should be on everyone. Think about it like open source communities for a new project. People contribute and put time and love and effort into it because they’re getting something out of it eventually. If we say it’s someone’s specific role to tag, it won’t happen. But when you see the benefit of using it as part of a wider community and can see what it brings, it gains momentum.
One of the reasons why data catalogs and data discovery platforms fail is because you don’t have enough documentation to actually derive value from it. The best time to get documentation is when it’s in the hand of the person who has the documentation. Say you’re creating a new event. You should be getting that information when they’re creating the event and enforcing checks and balances right there. This is a great way to ensure a description is trickled down. A good chunk of documentation can be based on the processes that you have in place. If you can get documentation in the flow of the user that’s producing the data, then you have a clear path to ensuring documentation is happening.
Data consumer and producer
Both the consumer and producer sides have a responsibility of ensuring documentation. One solution to this that’s emerging is automation -- having the ability to pull out existing operational metadata to augment the discovery perspective so that users can still find what they’re looking for, even if what they’re looking for might not have full documentation. If you use a popularity approach based on queries, these insights can help foster better documentation since you’ll know what data is being most commonly used.
Does there need to be a promotion process to ensure that data is accurate and trustful enough?
We’ve seen capabilities where folks are checking schemas with compliance tools and have GitHub actions to check with metadata service, like whether a particular commit is valid based on rules in a metadata service. A dataset can’t be promoted if it doesn't fit with the rules.
We have a notion of status tags that customers use to define which data sets are certified so that new analysts or someone initially looking immediately know that certain data sets are proven, tried and true, and ones that everyone is using. Customers can also mark tables as gold, silver, bronze, or raw.
Tagging and automation challenges
However, tagging brings an interesting automation challenge. What makes a data set “gold?” Someone might stamp a data set as “gold” today because of a certain policy, but what if that policy changes next week? Will the data set still be stamped “gold?” If we can encode and automate these policies with tagging, then that could lead to a more vibrant, manageable tagging system that’s not reliant on humans but based on the performance of metadata and policies associated with it.
Challenges with promoting data sets
Another core issue lies in the general notion of promoting data sets altogether. If multiple groups of people are using the same data sets in the same way, then we have a problem. You want data to be used in new and interesting ways.
When does data discovery reach its limit when handling PII and how can we try to consume PII in a secure way? What are the limits for data discovery?
Ability to discover all data
One approach that’s worked at modern organizations is to make data accessible so that people know it exists, but marked as sensitive. You won’t see richer metadata around it due to its sensitivity, you won’t see queries being run on it, you won’t see pull filing of data, you won’t be able to preview it. But you’ll be able to discover all data, whether it’s sensitive or not. Accessibility and the understanding of all data is democratized, but richer metadata is blocked for sensitive data unless you receive access to it.
Mirroring data infrastructure
Metadata infrastructure should mirror data infrastructure. Just like in data, we have RBAC and ABAC, the ability to anonymize data based on access. If you have access to data, you have some kind of access to metadata. We have companies where table names are sensitive. A single domain doesn’t want another domain to know the table exists. Having to design for these scenarios based on access privileges, you may not see the name of the table, but you may be able to see tags on the table. Mirror the capabilities that data systems have evolved over the years and apply the same things to metadata systems.
Integrating with existing RBAC policies
One thing that we’ll be working on is integrating with existing RBAC policies. Therefore, replicating the same privileges users have in their database or BI tools so that they have the same access in their data discovery platform. For example, if someone has specific access privileges in Snowflake or BigQuery, those same privileges should be replicated in their data discovery platform.
In terms of PII, as a metadata platform, we won’t be accessing the data, but by default we access query history. We will get exposed to someone’s query if someone accesses a sensitive column. We want to enable customers to define and tag sensitive data so that it can be removed before written to disc.