Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
close icon
3
November 9, 2022
26 April 2021

Lyft increases data scientist & analyst productivity by 20% with Amundsen

Amundsen used by 750 users, with 75% of Data Scientists, Analysts and Data Engineers using it every week.

Data Stack

Data Catalog
Amundsen
ETL
Apache Airflow
Data Warehouse
Presto
Business Intelligence
Mode, Apache Superset

Key Takeaways

  • Before Amundsen, Lyft Data Scientists spent over 33% of their time finding, understanding, and establishing trust in data.
  • Amundsen is able to integrate with Lyft's data warehouse, ETL engine, BI tools to programmatic use-cases and a user experience that enables data users to discover, understand and trust the data they need.
  • Amundsen is widely adopted within Lyft with 750 users and a penetration of over 75% of data scientists, analysts and data engineers.
  • Amundsen has increased data engineering and data science productivity by over 20%.

The Challenge

Due to high growth in data and employee base, Lyft Data Scientists were spending over 33% of their time finding, understanding, and establishing trust in data.

Lyft has grown significantly over the last five years, in terms of data stored and processed, as well as the number of employees who use data day-to-day to make data driven decisions.

This rapid growth led to a really acute challenge for data consumers (DataScientists) and data producers (Data Engineers and Product Engineers). At Lyft, some Data Scientists focus on decision science - helping shape critical business decisions, while others focus on building algorithms that power Lyft's internal and external products. The Lyft team undertook an internal user research project to understand and prioritize the most common pain points in the most common workflows of Data Scientists. This research found that the biggest pain point in Data Scientist's workflow was the amount they spent in finding, validating and exploring trustworthy data

Lyft data scientists spent over 33% of their time discovering and trusting data. These questions included¹:

  • Does this data exist? Where is it? What is the source of truth of that data? Do I have access to it?
  • Who and/or which team is the owner? Who are the common users?
  • Is there existing work I can re-use?
  • Can I trust this data?

The Solution

Amundsen captured metadata from existing data systems to power programmatic use-cases and a user experience for data users to discover, understand and trust the data they needed.

Lyft evaluated a few commercial and open-source solutions for data discovery and data cataloging. However, most existing solutions were heavily focussed on the idea of curating metadata and weren't scalable or flexible to incorporate different kinds of data sources or the diversity of use-cases.

Lyft created Amundsen to be able to:

  • Capture metadata from different data sources - data warehouses, BI/dashboarding systems, ETL tools, teams/HRIS systems
  • Build a model of what was trustworthy based on when, what and how data is produced or consumed
  • Expose this metadata via a catalog UI for users to discover, understand and trust data, as well as via an API for programmatic use-cases.

In addition to helping users establish trust in the data, Lyft was able to use the same solution for data warehouse migration. Data Engineers were able to understand who uses data and where does the data come from, while Data Scientists and other data users were able to see what data to use from what warehouse during the migration.

Users also really loved the functionality to preview data, see stats of various columns, frequent users, last updated date for a table, source code and DAG that generated the table and view and edit descriptions and tags associated with data.

Lastly, Amundsen helped begin the Data mesh implementation within the company. It enabled producers and consumers of data to interface directly with each other instead of relying on the data team to play telephone. Amundsen is able to plug-in with existing workflows of Data Producers (Product Engineers for source data and Data Engineers for derived data) to pull in and publish rich metadata about the data being produced. Additionally, data consumers (Data Scientists and Ops users) are able to request additional metadata and file issues with the data from within Amundsen, which then files a JIRA ticket and notifies the data producers.

The Results

Amundsen is widely adopted at Lyft and continues to remain the top scoring Data & ML product within Lyft ever since it was created.

It's used by 750 users, with 75% of Data Scientists, Analysts and Data Engineers using it every week.

"Amundsen helped analysts at Lyft spend less time figuring out what data to trust, and more time on what’s important - solving business problems. I’m excited for Stemma to be bringing the power of Amundsen to other data-driven organizations."
George Xing
Formerly Head of Analytics at Lyft

Amundsen has consistently scored really high CSAT from data users. Till this day, it continues to be the highest CSAT scoring Data & ML product at Lyft.

Technology
Ridesharing