Snowflake is arguably the most iconic vendor in the modern data stack and therefore one of the most common drivers of cost I’ve seen among data teams using Amundsen or Stemma. As part of the current breed of cloud-based warehouses, it allows organizations to store and analyze large amounts of data with ease. However, as with any storage and processing solution, costs can quickly add up through ongoing use. While Snowflake provides guidance for basic hygiene to manage costs, most teams can claw back more significant costs by reducing unnecessary compute costs driven by inefficient ETL jobs.
Focus on reducing compute over storage
The first thing to recognize is that most of your costs are likely to come from compute. Storage has become cheap and the effort to change the way data is stored is relatively high1. Compute costs can come from a variety of sources including ETL jobs, scheduled queries from BI tools, and ad hoc queries from SQL query tools. In my experience, ETL queries are by far the most common and highest return opportunity for cost reduction. As an added bonus, they are entirely in the domain of Data Engineering, which clears the way for any data team to address.
Reduce ETL compute from unused tables
One source of wasted ETL compute comes from jobs that routinely update unused tables. This is the lowest hanging fruit I see across companies, over and over again.
Here’s an example query we put together using the recently released Snowflake access_history view. This lists tables in your warehouse ordered by the number of queries against these tables over the last 7 days, the default retention in Snowflake:
Commonly queried tables over last 7 days
Combine the results of the above example query with the cost of ETL query that populates the table and you have your prioritized list of tables to deprecate.
Once you have that list, you can do one of the following remedies to reduce cost:
- Deprecate data - stop running the ETL, and subsequently delete the data.
- Reduce the ETL frequency from hourly/daily to weekly/monthly
- Change the table to be an unmaterialized view, instead of being a materialized table
NOTE: Do not forget to turn off Time Travel when deleting data. More best practices can be found in this article.
The low risk option is to just reduce the frequency of ETL updates, but that leads to cruft. So in my opinion, the better approach is to cut the ETL job and deprecate the table.
If you are lucky, you will have some tables that have zero queries against them. But, most tables are likely to have at least some queries and therefore some risk. The common risk associated with deprecating data or reducing its ETL frequency is that the table might still be in use. In a worst case scenario, deprecating such a table can accidentally skew business decisions by breaking a dashboard, cause ML model performance to drift due to poor training data, or waste analysts time if they use the data during exploratory analysis.
In order to reduce this risk, we will first get an accurate and comprehensive view of dependencies on data, and then effectively communicate changes to the right stakeholders. Here’s an example query in Snowflake that shows you what queries have been run on a particular table over the last 7 days:
Queries against a table over last 7 days
The next query will provide the accounts who ran the queries against a table. You can now contact users to better assess risk and communicate the upcoming change.
Accounts executing queries against a table over last 7 days (descending order)
Reduce ETL compute by optimizing data for common queries
Once unused data is eliminated, the next best improvement in compute cost comes from optimizing expensive queries. Here again, usage patterns can provide clues to identify assets for engineers to address.
Snowflake’s own quickstart guide offers a sample SQL query to identify the most expensive queries in the warehouse over the last 30 days. You can then use the query for names in the previous section to identify the accounts behind the expensive queries. We have found the most impactful queries to be unoptimized ETL queries and unoptimized scheduled queries coming from your BI tool.
Now that you have identified the expensive queries and learned which accounts are behind them, you are ready optimize the queries using one of the following techniques:
1. Pre-compute joins and move the queries to the materialized, pre-joined table
This is valuable when a lot of users are trying to do the same join over and over again. In such cases, it’s best to materialize a table that precomputes the join and move the biggest offenders to the new table.
2. Eliminate unnecessary joins, using partition/bucketing columns, etc.
This is valuable when queries simply aren’t using the best practices - for example, not using a date column that’s the partitioning/bucketing column, instead of computing the date using the date function on an unoptimized timestamp field. As you optimize the largest offenders, share the best practices with users writing those queries to dissuade them from those same patterns again.
Simplify the process for regular maintenance by using a modern catalog
While data catalogs are traditionally seen as a tool to make data discoverable to analysts, they are becoming increasingly popular as a way for data engineers to manage and organize all the assets they own. In addition to acting as a repository for metadata, a modern catalog provides access to usage patterns, and makes it easy to communicate with downstream users and owners of other tables.
You can follow this link to learn more about Stemma, the modern data catalog for self-serve cultures, used by companies like Grafana, iRobot, SoFi, and Convoy. Stemma was built by the creators of the leading open-source data catalog, Amundsen, used by Lyft, Instacart, Square, ING, Snap and many others.
: One notable exception is high dimension tables that churn frequently. These tables can surprisingly increase storage costs through Snowflake’s Fail-safe and Time Travel features, which require extra storage, and should get extra attention to how Continuous Data Protection (CDP) is configured.