Apache Airflow and DBT on Docker Compose

Executing DBT models through Apache Airflow can be both challenging and fun the same time! The previous post “Schedule dbt models with Apache Airflow” was a first a iteration on the topic. This time, the focus is on running Apache Airflow and DBT through Docker Compose. Using self-contained Docker images and a public ecommerce dataset from Instacart makes it more interesting and interactive. Let’s get into it!

One of the challenges integrating both tools is that they are heavily customisable. There is not a single best approach on how to run models and schedule them. The most common example online is to use the bash operator in order to execute ‘dbt run’ and mass-execute all models. As mentioned in my previous post, one of the cool features of Airflow is monitoring individual task execution in a single UI. So how do we do it?

First things first:

Download the ZIP’ed CSVs from Kaggle (the repository contains only 5 out of six files due to size limitations) and extract them at ./sample_data within the repository.

Next, activate the services by running $ docker-compose up. It will take a few minutes for all images and packages to download.

Docker Compose

Docker Compose plays a key aspect in setting up Apache Airflow and installing DBT. Furthermore, it is organising all the required services, centrally hosting environment variables, connection settings and generally orchestrating the setup. Key services are:

  • Airflow server
  • PostgreSQL DB for supporting Airflow
  • PostgreSQL DB for storing the DBT models
  • Adminer, light weight DB client to interact with either DB
docker compose services
Docker Compose services up and running

The backbone of Docker Compose is the docker-compose.yml file. Contains the architectural design of the services. This approach is much cleaner than running all services in a single machine.

Structuring DBT models

Once services are activated, a lot of things are going on. But the main area of focus is how the different models can be associated with different DAGs and DAG tasks. We want to have 3 types of models:

  • Initialisation tables which hold some data cleaning or tiding up transformations
  • Snapshot tables which hold an as-is view across all data
  • Daily incremental tables

Within DBT, the above requirement is satisfied as follows. 3 sets of models organised in folders: core, daily and initialisation.

dbt model structure
DBT Model Structure
  • Initialisation: meant to execute only ONCE AFTER the seed data is loaded. You will need to re-run this if the database is dropped.
  • Core: contain models that represent an as-is snapshot of the data within PostgreSQL.
  • Daily: incremental models that are scheduled and run on a daily basis.

How do we organise those models within the Airflow execution model? We take advantage of DBT tags within schema.yml. For example, every model under the core folder is tagged as “snapshot“. Similar tags are assigned to daily and initialisation folders.

dbt schema and using tags
schema.yml

Obviously you can get very creative with your requirements can complicate it as much as you need; or not!

Apache Airflow DAG definition

As Airflow is getting initialised, dbt compile is executed. As a result, the manifest.json file is updated; it holds all the information about the node structures, dependencies, raw SQL and tags assigned. Parsing this file provides all the vital information we need for building the Airflow tasks.

Parsing the manifest.json file
dag.py – parsing manifest.json

Once this information is available, we can assign different operators to different DAGs based on the their tags. For example, DBT models tagged as snapshot will be associated only with snapshot_dag.

Creating Airflow DAGs based on DBT tags
dag.py definition

Generally this is just a convention and to make it work, you will need to have alignment between your DBT models definition and the DAG file(s) logic. Otherwise it is very easy to break the process. For this reason, it is important to be as descriptive as possible and maintain good documentation.

Running the models

As soon as the services are up (remember: $ docker-compose up), you can navigate to http://localhost:8000/admin/ and you will see the DAGs below (Off by default):

Airflow UI
DAGs overview

They are numbered in order of execution. You will need to go step by step, activate them AND then trigger them. It is suggested to let one DAG finish before activating the next one. DAGs (in this demo) do not have awareness of each other so you might end up trying to use data that has not been loaded yet! To activate a DAG, first switch it on and then click “Trigger DAG”.

Activating a DAG
Triggering a manual DAG

If everything goes well, the tasks will start getting executed. The DAG “1_load_initial_data” will perform the necessary initialisations and data loading operations. Those are needed for DBT later on.

Loading DBT seed data
DBT seed data loading tasks

Furthermore, you can view the GANTT chart of the above operations and identify long running tasks.

AIrflow GANTT chart
DAG “1_load_initial_data” – GANTT chart

When it comes to running “4_daily_dbt_modes“, all you have to do is just switch it On. The DAG is on a @daily schedule and has a start_date of datetime(2019, 1, 1). As soon as it gets activated, it will automatically trigger historic tasks so data is caught up. Furthermore, you can monitor the time required to execute each daily task.

Daily task execution
4_daily_dbt_models – daily_orders – task duration

Finally, you can always connect to the DB (Adminer UI) and view the models’ output. Adminer is a simple and quick client.

Adminer UI
Adminer — top_selling_products model

Once you are done, you can deactivate the services ($ docker-compose down) and remove the associated containers & volumes ($ docker-compose rm).

FYI: In case you work on MacOS, be aware of an existing Docker bug that does not fully remove containers and volumes. Quick fix: $ docker run --privileged --pid=host docker/desktop-reclaim-space

Summary

Hopefully, this was a good practical example. Combining Apache Airflow and DBT is not trivial but Docker Compose makes the process more structured and streamlined. A approach that can be alternatively implemented, is parsing the raw SQL within the manifest.json. That eliminates the need to use the bash operator to execute dbt run.. commands (probably for the next iteration of the repository).

Additional reading: