5 min read

DBT Core - Generating dbt models from Snowflake

What is dbt

DBT (Data Build Tool) is a open-source tool that helps data analysts and data engineers develop and maintain data pipelines. It is designed to work a variety of databases and aims to make it easier to build, test, and deploy data transformations.

It is built on top of SQL, and provides a set of declarative commands that can be used to define and manipulate data. It includes a number of features that make it easier to work with data, including support for version control, testing, and documentation. DBT also includes a CLI (command-line interface) that can be used to run and manage data pipelines, as well as a web-based UI for visualizing and interacting with data.

DBT is commonly used in the context of data warehousing and business intelligence, where it can be used to build and maintain the data pipelines that feed data into analytical tools like PowerBI, Tableau and Looker. It is also used in a variety of other contexts, such as data lakes, data marts, and ETL (extract, transform, load) processes.

Why use dbt?

The main reasons to pick dbt over other tools are:

  1. It's SQL! Anyone remotely interested in data knows SQL, it's a widely-used and powerful programming language for working with data.

  2. DBT supports testing and version control, which helps to ensure the quality and reliability of data pipelines.

  3. DBT includes a number of plug-ins and extensions that can be used to extend its functionality, such as support for additional databases and code generation.

  4. DBT is used by a wide range of organizations, which means that there is a large community of users and a wealth of resources available to help you get started and troubleshoot issues.

dbt core vs dbt cloud

DBT (Data Build Tool) Core is the open-source version of DBT that you can download and install on your own computer or server. It includes a command-line interface (CLI) that you can use to run and manage data pipelines, as well as a web-based UI for visualizing and interacting with data.

DBT Cloud is a commercial product offered by Fishtown Analytics that provides a cloud-based platform for running DBT pipelines. It includes all of the features of DBT Core, but with the added convenience of being able to run pipelines in the cloud. This can be useful if you don't want to manage your own infrastructure, or if you need to scale your pipelines to run on large amounts of data.

One key difference between DBT Core and DBT Cloud is that DBT Cloud includes additional features that are not available in the open-source version. These include:

  • Automatic scheduling: DBT Cloud can automatically schedule your pipelines to run on a regular basis, without the need to manually trigger them.

  • Collaboration tools: DBT Cloud includes tools for collaborating with your team, such as the ability to share pipelines and track changes.

  • Monitoring and alerts: DBT Cloud includes monitoring and alerting features that can help you keep track of the health and performance of your pipelines.

DBT Core, could essentially be used in a similar manner as DBT Cloud, but you would be responsible for the infrastructure (for example using Apache Airflow for scheduling).

Installing dbt core

First of all, it's a good idea to either install mamba or miniconda so that it's easy to create new python environments. I installed mamba in this case, its a drop-in replacement for conda and uses the same commands and configuration options as conda, but it's faster and more lightweight. With mamba installed , create a new env, activate it and install the dbt dependencies (together with some other useful packages).

bash
mamba create -y -n de -c conda-forge python=3.10 pip-tools pandas numpy ipykernel sqlalchemy psycopg2 pyodbc plotly plotly_express xlrd openpyxl
mamba activate de
pip install dbt-core dbt-postgres dbt-snowflake

I'm installing both the postgres and snowflake adapters as I use both. Now we need to create a dbt profile. Make a new folder where we will store our dbt projects

bash
mkdir dbtprojects
cd dbtprojects

Creating a dbt project

then we can initialize a new project.

bash
dbt init

Follow the prompts, name your project. choose the adapter you want, in this case snowflake and enter your snowflake details. You will notice that there is a new file created under ~/.dbt/profiles.yml.

Now we can cd into the project directory (I'm calling it your_project_name) and create a new file called packages.yml

bash
cd your_project_name
touch packages.yml

Installing dbt packages

In this file we could define all the dbt dependencies we would like to install. A useful package to install is called dbt-labs/codegen

yml
packages.yml
packages:
- package: dbt-labs/codegen
version: 0.8.1

This package enables code generations macros. To install this dependency, in the cli use dbt

bash
dbt deps

Generating dbt models

Now we are ready to code generate what we already have in Snowflake. First we have to create our sources.yml file under the models/your_project_name directory (you could rename the example directory to your_project_name and delete all the example schema.yml and sql models inside). For our example we can have the following with our SALESEXAMPLE table which is already in our database. Remember to change the source details (name, database and schema to match your snowflake details).

yml
version: 2
sources:
- name: thesourcename
database: your_snowflake_db
schema: your_snowflake_schema
quoting:
database: true
schema: false
identifier: false
tables:
- name: SALESEXAMPLE
identifier: SALESEXAMPLE
description: The example table we will use

Note that there is a pretty good best practices guide available on dbt's guides. I will only be showing how to quickly get started. Following the best practices (i.e. creating your folder structure) is up to you 😄.

With that done, we can generate our base model for the table via the command line

Using the CLI to generate models

bash
dbt run-operation generate_base_model --args '{"source_name": "thesourcename", "table_name": "SALESEXAMPLE", "case_sensitive_cols": false}'

The output looks like the following (my sourcename)

Using VS-Code extensions to generate models

Another nice way to achieve the same result is to install the VS-Code extension dbt Power User. Once installed and you went through the extension's documentation, select the Python interpreter with dbt installed (in VS-Code CTRL+SHIFT+P -> Python: Select Interpreter -> Python 3.10.6('de') for our virtual environment). You could then generate the models directly from VS-Code, by clicking on the Generate model in the sources.yml file.

This will be our base model for the source table SALESEXAMPLE. The code generation is VERY useful with tables which have loads of columns (I'm pretty lazy when it comes to writing out column names). You can save the result within a new file. I just called it salesexample_base.sql.

In my base model, I don't need all the columns, So I end up with only the following

sql
salesexample_base.sql
with source as (
select * from {{ source('mydb', 'SALESEXAMPLE') }}
),
renamed as (
select
quantityordered,
sales,
orderdate
from source
)
select * from renamed

To demonstrate I will make a simple grouped view with this model.

Create a new file models/example/salesexample_grouped.sql

sql
salesexample_grouped.sql
with
salesexample as (select * from {{ ref('salesexample_base') }}),
final as (select year(orderdate) as year,
month(orderdate) as month,
sum(quantityordered) as totalordered,
sum(sales) as totalsales
from salesexample
group by year(orderdate), month(orderdate)
order by year desc, month desc
)
select *
from final

If we want all our models in the example folder to be materialize as views by default then we need to make a small change in dbt_project.yml. We are changing +materialized to view.

yml
models:
example:
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view

Note that alternatively, materializations can be configured directly inside of the model sql files with the config jinja function {{ config(materialized='view') }} as the very first line of our file.

Running our dbt project

Last thing we need to do is to look at our schema.yml file, which is our definitions for our models. You can also add the column names if you want, they are not resources in and of themselves, but it's possible to define sub-properties like test to columns to test if a column contains no duplicates and no null values for example.

yml
version: 2
models:
- name: salesexample_base
description: 'SALESEXAMPLE table'
columns:
- name: quantityordered
- name: sales
- name: orderdate
- name: salesexample_grouped
description: 'SALESEXAMPLE table grouped by orderdate'

We can compile our dbt project with

bash
dbt compile

If there are no errors, we can now execute the sql model files against our target database

bash
dbt run

Now if we sign in to Snowflake, you will see the two new views are there with

sql
show terse views;

© 2022 Jacques du Preez. All rights reserved.