Lyft has grown significantly over the last five years, in terms of data stored and processed, as well as the number of employees who use data day-to-day to make data driven decisions.
This rapid growth led to a really acute challenge for data consumers (DataScientists) and data producers (Data Engineers and Product Engineers). At Lyft, some Data Scientists focus on decision science - helping shape critical business decisions, while others focus on building algorithms that power Lyft's internal and external products. The Lyft team undertook an internal user research project to understand and prioritize the most common pain points in the most common workflows of Data Scientists. This research found that the biggest pain point in Data Scientist's workflow was the amount they spent in finding, validating and exploring trustworthy data
Lyft data scientists spent over 33% of their time discovering and trusting data. These questions included¹:
Lyft evaluated a few commercial and open-source solutions for data discovery and data cataloging. However, most existing solutions were heavily focussed on the idea of curating metadata and weren't scalable or flexible to incorporate different kinds of data sources or the diversity of use-cases.
Lyft created Amundsen to be able to:
In addition to helping users establish trust in the data, Lyft was able to use the same solution for data warehouse migration. Data Engineers were able to understand who uses data and where does the data come from, while Data Scientists and other data users were able to see what data to use from what warehouse during the migration.
Users also really loved the functionality to preview data, see stats of various columns, frequent users, last updated date for a table, source code and DAG that generated the table and view and edit descriptions and tags associated with data.
Lastly, Amundsen helped begin the Data mesh implementation within the company. It enabled producers and consumers of data to interface directly with each other instead of relying on the data team to play telephone. Amundsen is able to plug-in with existing workflows of Data Producers (Product Engineers for source data and Data Engineers for derived data) to pull in and publish rich metadata about the data being produced. Additionally, data consumers (Data Scientists and Ops users) are able to request additional metadata and file issues with the data from within Amundsen, which then files a JIRA ticket and notifies the data producers.
It's used by 750 users, with 75% of Data Scientists, Analysts and Data Engineers using it every week.
Amundsen has consistently scored really high CSAT from data users. Till this day, it continues to be the highest CSAT scoring Data & ML product at Lyft.