Executing DBT models through Apache Airflow can be both challenging and fun the same time! The previous post “Schedule dbt models with Apache Airflow” was a first a iteration on the topic. This time, the focus is on running Apache Airflow and DBT through Docker Compose. Using self-contained Docker images and a public ecommerce dataset from Instacart makes it more interesting and interactive. Let’s get into it!
One of the challenges integrating both tools is that they are heavily customisable. There is not a single best approach on how to run models and schedule them. The most common example online is to use the bash operator in order to execute ‘dbt run’ and mass-execute all models. As mentioned in my previous post, one of the cool features of Airflow is monitoring individual task execution in a single UI. So how do we do it?
First things first:
- Download Docker and Docker Compose
- Clone the repository from GitHub
- Download the Instacart dataset from Kaggle
Download the ZIP’ed CSVs from Kaggle (the repository contains only 5 out of six files due to size limitations) and extract them at ./sample_data within the repository.
Next, activate the services by running
$ docker-compose up. It will take a few minutes for all images and packages to download.
Docker Compose plays a key aspect in setting up Apache Airflow and installing DBT. Furthermore, it is organising all the required services, centrally hosting environment variables, connection settings and generally orchestrating the setup. Key services are:
- Airflow server
- PostgreSQL DB for supporting Airflow
- PostgreSQL DB for storing the DBT models
- Adminer, light weight DB client to interact with either DB
The backbone of Docker Compose is the docker-compose.yml file. Contains the architectural design of the services. This approach is much cleaner than running all services in a single machine.
Structuring DBT models
Once services are activated, a lot of things are going on. But the main area of focus is how the different models can be associated with different DAGs and DAG tasks. We want to have 3 types of models:
- Initialisation tables which hold some data cleaning or tiding up transformations
- Snapshot tables which hold an as-is view across all data
- Daily incremental tables
Within DBT, the above requirement is satisfied as follows. 3 sets of models organised in folders: core, daily and initialisation.
- Initialisation: meant to execute only ONCE AFTER the seed data is loaded. You will need to re-run this if the database is dropped.
- Core: contain models that represent an as-is snapshot of the data within PostgreSQL.
- Daily: incremental models that are scheduled and run on a daily basis.
How do we organise those models within the Airflow execution model? We take advantage of DBT tags within schema.yml. For example, every model under the core folder is tagged as “snapshot“. Similar tags are assigned to daily and initialisation folders.
Obviously you can get very creative with your requirements can complicate it as much as you need; or not!
Apache Airflow DAG definition
As Airflow is getting initialised, dbt compile is executed. As a result, the manifest.json file is updated; it holds all the information about the node structures, dependencies, raw SQL and tags assigned. Parsing this file provides all the vital information we need for building the Airflow tasks.
Once this information is available, we can assign different operators to different DAGs based on the their tags. For example, DBT models tagged as snapshot will be associated only with snapshot_dag.
Generally this is just a convention and to make it work, you will need to have alignment between your DBT models definition and the DAG file(s) logic. Otherwise it is very easy to break the process. For this reason, it is important to be as descriptive as possible and maintain good documentation.
Running the models
As soon as the services are up (remember:
$ docker-compose up), you can navigate to http://localhost:8000/admin/ and you will see the DAGs below (Off by default):
They are numbered in order of execution. You will need to go step by step, activate them AND then trigger them. It is suggested to let one DAG finish before activating the next one. DAGs (in this demo) do not have awareness of each other so you might end up trying to use data that has not been loaded yet! To activate a DAG, first switch it on and then click “Trigger DAG”.
If everything goes well, the tasks will start getting executed. The DAG “1_load_initial_data” will perform the necessary initialisations and data loading operations. Those are needed for DBT later on.
Furthermore, you can view the GANTT chart of the above operations and identify long running tasks.
When it comes to running “4_daily_dbt_modes“, all you have to do is just switch it On. The DAG is on a @daily schedule and has a start_date of datetime(2019, 1, 1). As soon as it gets activated, it will automatically trigger historic tasks so data is caught up. Furthermore, you can monitor the time required to execute each daily task.
Finally, you can always connect to the DB (Adminer UI) and view the models’ output. Adminer is a simple and quick client.
Once you are done, you can deactivate the services (
$ docker-compose down) and remove the associated containers & volumes (
$ docker-compose rm).
FYI: In case you work on MacOS, be aware of an existing Docker bug that does not fully remove containers and volumes. Quick fix:
$ docker run --privileged --pid=host docker/desktop-reclaim-space
Hopefully, this was a good practical example. Combining Apache Airflow and DBT is not trivial but Docker Compose makes the process more structured and streamlined. A approach that can be alternatively implemented, is parsing the raw SQL within the manifest.json. That eliminates the need to use the bash operator to execute
dbt run.. commands (probably for the next iteration of the repository).
- If you have access to Adobe Analytics Clickstream data and Google BigQuery: https://analyticsmayhem.com/dbt/schedule-dbt-models-with-apache-airflow/