When working with dbt, the experience can be seamless at its best, but baffling at its worst. This is especially apparent to data engineers who need to dig down into SQL queries to debug their table’s lineage, while at the same time being responsible for their on-time delivery. Thankfully a little bit of understanding goes a long way towards making this process faster and less painful. The key is to be aware of how dbt materializations work and how they use temporary tables to keep data up to date.
The team at Stemma looked into this, since we parse the SQL queries being run on the data warehouse to automatically capture lineage between data assets. While the manifest.json file describes lineage accurately between models, it is not always available to the catalog, so parsing SQL is a useful alternative.
Users of dbt can set different materializations for their models. Materializations are strategies for persisting dbt models, and they influence what queries dbt runs to actually compute the models. More information about materializations can be found in the dbt docs.
The view materialization is the default strategy, and it’s also the simplest and most lightweight option. It simply creates views out of our models using a CREATE VIEW AS statement. This doesn’t actually create any new data. It can also be slow for any serious computations, since these have to be redone each time the model is queried.
The table materialization is also a fairly simple option. It creates a table from our model using CREATE TABLE AS in SQL. This actually materializes the data, meaning that it will be fast to query, but it also takes up space.
These materializations create an intermediary relation in the background. This means that dbt creates a table/view (relation from now on) with the same name as our model, but with the suffix __dbt_tmp . dbt first drops any preexisting intermediary relations our model might have. Then it computes fresh data into the intermediary relation (or just creates a view). Afterwards, it backs up our old model, and renames the intermediary relation to the final name. This all happens in a single database transaction. This approach avoids losing data in the event that the new model’s code errors out. Such a loss could happen if dbt were to drop our old relation first, and then for some reason was unable to create the new one. It’s also an atomic operation, so there is no point where we are without data. Finally, if everything’s gone to plan the backup relation is deleted, and all that remains is the up-to-date model.
The incremental materialization is useful if the model is expensive to compute. Using this strategy, dbt only appends or updates rows in the table that have changed upstream since the last time dbt ran. However, it’s a bit more complex to configure than the other materializations, and has more pitfalls.
Incremental tables are more complicated under the hood than table or view materializations. Their behavior depends on whether we are doing a full refresh, whether we specified a unique key, and so on. Therefore, we’ll only go into enough detail to understand temporary tables.
Dbt first materializes the new data into a temporary table using the specified filtering criteria so it doesn’t have to materialize the whole table. If a unique key is specified for the rows, it inserts or updates the rows in the temporary table into the target. If not, it works in append-only mode.
All of the materializations we’ve covered so far create a temporary or intermediate relation. If we want to derive some lineage between our models, or between models and assets not managed by dbt, we won’t see the actual models being created in our SQL query history. Instead, the SQL statements we should be looking for are creating relations with the __dbt_tmp suffix.
Ephemeral materializations are useful, but instead of generating SQL statements, they’re interpolated into the models’ code that use them. Therefore they’re not relevant to our investigation.
Uncovering dbt’s tricks
In light of the above, the SQL queries performed by dbt suddenly make sense:
- Table/view materialization: We are not going to find CREATE TABLE AS statements with our model's name – it's always going to be the temporary relation
- Incremental materialization: We will see lots of INSERT operations, coming from the temporary table to the target incremental table, and we will see a CREATE TABLE AS for the temporary table
When we look at all the built in materializations, it turns out that whenever we see a CREATE TABLE AS or CREATE VIEW AS for a relation that ends with __dbt_tmp , whatever upstream dependencies that SQL statement implies, those will be true for our final model as well. Of course this doesn’t apply to any custom materializations we write ourselves. While dbt introduces some complexity to the SQL trail of lineage, it is well worth the benefits that dbt provides. This complexity is also automatically handled by dbt’s own presentation of lineage as well as that within some dedicated data observability tools like the Stemma data catalog.