Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
close icon
February 1, 2023
February 1, 2023
-
5
min read

How to fix your ETL to lower Snowflake Costs

by
Mark Grover
Co-founder, CEO of Stemma
Share This Article

Snowflake is arguably the most iconic vendor in the modern data stack and therefore one of the most common drivers of cost I’ve seen among data teams using Amundsen or Stemma. As part of the current breed of cloud-based warehouses, it allows organizations to store and analyze large amounts of data with ease. However, as with any storage and processing solution, costs can quickly add up through ongoing use. While Snowflake provides guidance for basic hygiene to manage costs, most teams can claw back more significant costs by reducing unnecessary compute costs driven by inefficient ETL jobs.

Focus on reducing compute over storage

The first thing to recognize is that most of your costs are likely to come from compute. Storage has become cheap and the effort to change the way data is stored is relatively high1. Compute costs can come from a variety of sources including ETL jobs, scheduled queries from BI tools, and ad hoc queries from SQL query tools. In my experience, ETL queries are by far the most common and highest return opportunity for cost reduction. As an added bonus, they are entirely in the domain of Data Engineering, which clears the way for any data team to address.

Reduce ETL compute from unused tables

One source of wasted ETL compute comes from jobs that routinely update unused tables. This is the lowest hanging fruit I see across companies, over and over again.

Here’s an example query we put together using the recently released Snowflake access_history view. This lists tables in your warehouse ordered by the number of queries against these tables over the last 7 days, the default retention in Snowflake:

Commonly queried tables over last 7 days

SELECT    t.table_catalog,
          t.table_schema,
          t.table_name,
          Count(q.NAME) AS count_queries
FROM      snowflake.account_usage.tables t
LEFT JOIN
          (
                 SELECT f1.VALUE:"objectName"::string AS NAME
                 FROM   snowflake.account_usage.access_history ah
                 JOIN   snowflake.account_usage.query_history qh
                 ON     ah.query_id = qh.query_id ,
                        lateral flatten(base_objects_accessed) f1
                 WHERE  ah.query_start_time >= dateadd('day', -7, CURRENT_TIMESTAMP())
                 AND    qh.end_time >= dateadd('day', -7, CURRENT_TIMESTAMP())
                 AND    objects_modified IS NULL
                 OR     objects_modified = []
                 AND    f1.VALUE:"objectDomain"::string='Table' ) q
ON        replace(q.NAME, '"') = concat(t.table_catalog, '.', t.table_schema, '.', t.table_name)
WHERE     t.deleted IS NULL
GROUP BY  t.table_catalog,
          t.table_schema,
          t.table_name
ORDER BY  count_queries ASC;

Combine the results of the above example query with the cost of ETL query that populates the table and you have your prioritized list of tables to deprecate. 

Once you have that list, you can do one of the following remedies to reduce cost:

  1. Deprecate data - stop running the ETL, and subsequently delete the data.
  2. Reduce the ETL frequency from hourly/daily to weekly/monthly
  3. Change the table to be an unmaterialized view, instead of being a materialized table

NOTE: Do not forget to turn off Time Travel when deleting data. More best practices can be found in this article.

The low risk option is to just reduce the frequency of ETL updates, but that leads to cruft. So in my opinion, the better approach is to cut the ETL job and deprecate the table. 

If you are lucky, you will have some tables that have zero queries against them. But, most tables are likely to have at least some queries and therefore some risk. The common risk associated with deprecating data or reducing its ETL frequency is that the table might still be in use. In a worst case scenario, deprecating such a table  can accidentally skew business decisions by breaking a dashboard, cause ML model performance to drift due to poor training data, or waste analysts time if they use the data during exploratory analysis.

In order to reduce this risk, we will first get an accurate and comprehensive view of dependencies on data, and then effectively communicate changes to the right stakeholders. Here’s an example query in Snowflake that shows you what queries have been run on a particular table over the last 7 days:

Queries against a table over last 7 days

SELECT qh.end_time,
       qh.query_type,
       qh.user_name,
       qh.query_text
FROM   snowflake.account_usage.access_history ah
JOIN   snowflake.account_usage.query_history qh
ON     ah.query_id = qh.query_id ,
       lateral flatten(base_objects_accessed) f1
WHERE  ah.query_start_time >= dateadd('day', -7, CURRENT_TIMESTAMP())
AND    qh.end_time >= dateadd('day', -7, CURRENT_TIMESTAMP())
AND    f1.VALUE:"objectDomain"::string='Table'
AND    objects_modified IS NULL
OR     objects_modified = []
AND    replace(f1.VALUE:"objectName"::string, '"') = concat('<CATALOG NAME>', '.', '<SCHEMA NAME>', '.', '<TABLE NAME>');

The next query will provide the accounts who ran the queries against a table. You can now contact users to better assess risk and communicate the upcoming change.

Accounts executing queries against a table over last 7 days (descending order)

SELECT   qh.user_name,
         Count(*) AS cnt
FROM     snowflake.account_usage.access_history ah
JOIN     snowflake.account_usage.query_history qh
ON       ah.query_id = qh.query_id ,
         lateral flatten(base_objects_accessed) f1
WHERE    ah.query_start_time >= dateadd('hour', -7, CURRENT_TIMESTAMP())
AND      qh.end_time >= dateadd('hour', -7, CURRENT_TIMESTAMP())
AND      f1.VALUE:"objectDomain"::string='Table'
AND      objects_modified IS NULL
OR       objects_modified = []
AND      replace(f1.VALUE:"objectName"::string, '"') = concat('<CATALOG NAME>', '.', '<SCHEMA NAME>', '.', '<TABLE NAME>')
GROUP BY qh.user_name
ORDER BY cnt DESC;

Reduce ETL compute by optimizing data for common queries 

Once unused data is eliminated, the next best improvement in compute cost comes from optimizing expensive queries. Here again, usage patterns can provide clues to identify assets for engineers to address. 

Snowflake’s own quickstart guide offers a sample SQL query to identify the most expensive queries in the warehouse over the last 30 days. You can then use the query for names in the previous section to identify the accounts behind the expensive queries. We have found the most impactful queries to be unoptimized ETL queries and unoptimized scheduled queries coming from your BI tool.

Now that you have identified the expensive queries and learned which accounts are behind them, you are ready optimize the queries using one of the following techniques:

1. Pre-compute joins and move the queries to the materialized, pre-joined table

This is valuable when a lot of users are trying to do the same join over and over again. In such cases, it’s best to materialize a table that precomputes the join and move the biggest offenders to the new table.

2. Eliminate unnecessary joins, using partition/bucketing columns, etc.

This is valuable when queries simply aren’t using the best practices - for example, not using a date column that’s the partitioning/bucketing column, instead of computing the date using the date function on an unoptimized timestamp field. As you optimize the largest offenders, share the best practices with users writing those queries to dissuade them from those same patterns again.

Simplify the process for regular maintenance by using a modern catalog

See downstream dependencies of a particular table or column (image: Stemma data catalog)

While data catalogs are traditionally seen as a tool to make data discoverable to analysts, they are becoming increasingly popular as a way for data engineers to manage and organize all the assets they own. In addition to acting as a repository for metadata, a modern catalog provides access to usage patterns, and makes it easy to communicate with downstream users and owners of other tables.

Message table owners, users, and downstream stakeholders (image: Stemma data catalog)

You can follow this link to learn more about Stemma, the modern data catalog for self-serve cultures, used by companies like Grafana, iRobot, SoFi, and Convoy. Stemma was built by the creators of the leading open-source data catalog, Amundsen, used by Lyft, Instacart, Square, ING, Snap and many others.

[1]: One notable exception is high dimension tables that churn frequently. These tables can surprisingly increase storage costs through Snowflake’s Fail-safe and Time Travel features, which require extra storage, and should get extra attention to how Continuous Data Protection (CDP) is configured.

Share This Article
Oops! Something went wrong while submitting the form.

Next Articles

November 9, 2022
June 21, 2022
-
4
min read

Balancing Proactive and Reactive Approaches to Data Management - Part 1

Data management is best handled by balancing proactive and reactive methods

November 9, 2022
October 7, 2021
-
min read

3 Steps for a Successful Data Migration

Learn the 3 crucial steps that great data engineering teams follow for a successful data migration.

November 9, 2022
March 9, 2022
-
min read

Stemma vs. Amundsen

Learn how Stemma is different from Amundsen