Working with dbt certainly makes the life of an analyst much better. It brings a certain structure that without a framework is difficult to achieve. One aspect however that I find lacking in dbt is monitoring the ongoing execution and performance of models. This can be substantially improved by scheduling dbt models with Apache Airflow.
UPDATE May 16th, 2020: This guide focuses on using Adobe Analytics Clickstream data on BigQuery. If you do not have access to either but you still want to explore Airflow and DBT, check my latest guide (https://analyticsmayhem.com/dbt/apache-airflow-dbt-docker-compose/). It utilises a public dataset from Instacart offered by Kaggle and runs on Docker Compose.
What is Apache Airflow
Apache Airflow is an open source workflow management platform. It was built in AirBnB around 2014, later on was open-sourced and then gradually found its way through multiple teams and companies. It can be used not just to automate/schedule ETL jobs but it is a general workflow management tool.
Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers…https://en.wikipedia.org/wiki/Apache_Airflow
A work flow is defined programmatically; with all the pros and cons that come along.
As soon as Airflow is initiated, the above sample (demo code) is translated in the following tree view.
How Airflow assists dbt
As mentioned in my previous posts (i.e. https://analyticsmayhem.com/adobe-analytics/model-data-bigquery-dbt/), in dbt you can monitor the sequential execution of each individual model, observe errors and check how long each model takes. After all, dbt is based on using DAGs to establish the hierarchy of models.
But if you have multiple models running (daily, weekly etc.) you are not able to:
- monitor the performance of each model individually and identify degradations in the execution time
- easily review task execution status for all past tasks
- reschedule past task executions for multiple time windows
What is – or looks – common
One of the greatest features of dbt is the documentation site that it automatically generates based on your SQL models. Within it, there is a lineage graph that helps easily navigate through the nodes and understand hierarchy.
In Airflow, every single task is translated into a node of the DAG. In order to convert the dbt DAG into an Airflow DAG, we need to manually re-construct it. Thankfully, as soon as dbt models are compiled, the model dependency is exported in manifest.json. Then it is a matter of parsing the file properly and building the workflow (sample code). Eventually we can reach the equivalent state in Airflow.
Why try Airflow
First, you can monitor the status of each individual task in a nice and organised manner. Furthermore you can rerun past tasks – this requires implementing Airflow using Kubernetes or Celery (topic for another post).
On top of that, Airflow offers (even though admittedly not perfect) additional charts to monitor your dbt models. The Gantt chart is helpful to assess each model for a particular time frame. Below – as expected – to calculate the daily visits through Adobe Analytics Data feeds is much more time consuming than individual page views. This is because page views is a matter of counting rows; visits requires a combination of COUNT/DISTINCT and CONCAT functions.
Furthermore, the daily trend is helpful to watch-out for changes in the pattern of processing and spot spikes or drops.
GitHub repo sample
In order to assess the above, have build a POC version through Docker.
However it relies on certain assumptions in order to make the process work end-to-end:
- Adobe Analytics Clickstream data is already loaded BigQuery (dataset name: adobe_analytics, table: click_stream).
- The dbt models used run on Google BigQuery – sample repository: https://github.com/konosp/adobe-clickstream-dbt.git
- Airflow runs in a Docker container. Within the container, the dbt profile.yml and the GCP service account credentials are copied. The former is needed to compile dbt. The latter is needed to run the models on GCP.
- The container clones the repository from #1 and compiles the dbt models to obtain the nodes hierarchy
In order to run the process fully, you will need to:
- Make sure you have a dataset named adobe _analytics and a table click_stream with the default column names.
- Fork the Airflow repository https://github.com/konosp/dbt-on-airflow.
- Go through the README to update the profile-demo.yml and service_account_key.json files accordingly. Files are at https://github.com/konosp/dbt-on-airflow/tree/master/misc
- Build and run Airflow through Docker: docker build -t dbt-airflow . && docker run -it –rm -p 8080:8080 dbt-airflow https://github.com/konosp/adobe-clickstream-dbt.git If you forked the repository, you can simply replace the URL above with your own repo
I am not under the illusion that the above process does not significantly complicates the deployment of dbt. The pros of Airflow do not necessarily outweigh all the complexities introduced. However this is an option that offers some extra flexibility and utility (at a cost).
Furthermore, I have not even touched the Kubernetes-based deployment for a dynamic task re-execution.
- Apache Airflow and DBT using an eCommerce Kaggle Dataset: https://analyticsmayhem.com/dbt/apache-airflow-dbt-docker-compose/
- Adobe Analytics Clickstream & DBT: https://analyticsmayhem.com/adobe-analytics/clickstream/bigquery-data-models-dbt-sql/
- Airflow documentation: https://airflow.apache.org/docs/stable/
- Article by AirBnB: https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8