A little while ago Mark sat down with Jared, Rich, and Arkady on the Intricity101 podcast to talk about Stemma and why it is becoming a critical tool for modern data teams. You can listen to the podcast on Spotify here or watch it on YouTube here. I won’t attempt to cover the entire discussion but rather highlight the points that explain why Stemma is drawing so much attention now. In short, the team at Stemma has long been active in the current data revolution, building tools that help more companies democratize their data.
Data for Tech Innovators
The beginning of the current data boom was marked by tech unicorns winning by making big bets on data. From the outside it looked like magic, as technology meshed with customer needs to create products that became indispensable. Under the hood, however, data teams at these companies were scrambling to get insights from the applications back to the product owners so they could keep customers engaged.
The primary challenge these teams faced was the ever-changing nature of data in a modern application environment. There are three factors that are always in motion at a tech innovator:
- Data sources - change is driven by strategic actions outside the data team’s control as silos are broken down, companies are acquired, and new product lines or divisions are launched
- The people - it’s the nature of high tech that talent comes and goes, with new hires needing help orienting themselves and departing talent taking tribal knowledge with them
- The pipelines - migration is a fact of life as teams grow and require new systems or old systems age out
Keeping track of data in this dynamic environment was a huge tax on productivity at these small companies. The standard practice at the time was to enter curated documentation in a legacy data catalog. But these solutions were designed for slower, bigger teams who could afford the time and manpower needed to make those manual solutions successful.
For modern teams these tools were simply unusable. They could not afford to staff full-time data stewards, an approach that removes skilled analysts from the front line of insight gathering to instead serve in passive, on-call support roles. The menial nature of this work meant it was also difficult to recruit part-time volunteers. These data teams were already punching above their weight class in their industries but, analysts found the work before them growing as success drew new competitors looking to copy their business models.
It was in this environment that Mark created the Amundsen project at Lyft to help the data team solve two key use cases: data discovery and lineage-based change management. In the first case, data analysts were awash in data as redundant or iterative models ran alongside each other. Simply finding the right data to work with took 25-30% of their time. In the second case, data engineers had to somehow anticipate the impact of pipeline changes on users. The best solutions available to them were broad Slack notifications that quickly became ignored.
Mark and his team found enormous success by focusing on three principles: automate the tedious work of data stewardship, create a product that provides deep insights into how data is used, and integrate with user workflows where possible rather than try to overwrite them. The result was rapid mass-adoption. The platform saw 80% adoption at Lyft and became an open source project adopted by over 35 companies including Brex, Doordash, ING, Instacart, and Snap, and kept active by over two-thousand contributors.
Data for all Domain Experts
The story doesn’t end there, however. As marketplace winners continue to be determined by their ability to leverage insights from data, there is pressure across industries to bring all domain experts into the analytics process. In the time since Amundsen went open source, the modern data stack has changed to accommodate this need. Data consumption was already being democratized as the new wave of business intelligence tools, like Tableau, gained mainstream adoption by business users. But now data production is experiencing rapid evolution. The emergence of low-code ETL tools, like dbt, that enable non-engineers to become creators are making the generation of data accessible to non-engineers.
Like so many of the IT transformations of the Cloud era, the new operational capabilities of the modern data stack bring with them complexity that falls upon an internal team to resolve. In this case it is the data team, made up of engineers and analysts, who are busy managing the three forces of change mentioned earlier. Now, this team must figure out how to continue supporting the generation of insights even as critical processes are taken out of their hands and passed on to less technical users. The next major cultural change on the path to true data democratization is to remove the data team as a bottleneck to insights without taking away so much control that they can no longer be responsible for the delivery and accuracy of data.
Stemma is the solution for this new transition. Data catalogs have always been about making data accessible to users, and the team at Stemma has always been focused on making catalogs that are effective and usable by modern teams.