Adobe SQL Models with dbt

Adobe Analytics Clickstream data with dbt & SQL – Guide

Have previously elaborated on both the usefulness of working with raw clickstream data and how powerful I consider dbt (data build tool) in managing SQL models. This time I will take a more practical approach to go through the basic setup to build some initial SQL models in BigQuery based on the Adobe Analytics Clickstream data. The topics to cover are:

  • Loading Adobe Analytics data
  • Setup
  • Compiling SQL files
  • Running dbt
  • Basic Metrics

Loading Adobe Analytics data

Before we get started, you need to have Clickstream data loaded in a BigQuery dataset. If you have not already done so, you can find more information about the topic at Load Data Feeds to Google BigQuery to understand the key steps and configurations of the process.

Furthermore, if you are not Project Admin in the GCP project, make sure you have the following permissions (are needed to execute dbt):

  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User

Setup

First you need to install the library; you have a few options available (https://docs.getdbt.com/docs/installation). For the remainder of this guide, I have utilised pip as an installation method on MacOS. After you have installed it, clone the repository from https://github.com/konosp/adobe-clickstream-dbt. Navigate into adobe-clickstream-dbt. The path models/introduction contains 5 SQL files; those are dynamic files based on Jinja – a templating language.

Let’s examine the first one; clickstream_full.sql. You will notice it looks like a regular SQL file that queries a typical Data feed file from Adobe Analytics. The key point is anything between double curly brackets gets parsed by Jinja. config is capturing model-specific configurations i.e. type of materialisation (view/table/nested). var are dynamic variables (see below) and ref are references to other models. References help dbt assess model hierarchy and appropriate order of execution.

sample model
Sample model

Compiling SQL files

Similar structure exists in the rest of the models. Now execute in terminal dbt compile. If the output is similar to below, the library installation is completed successfully and the models are discoverable.

dbt compile output
dbt compile output

However this does not mean the queries have been executed on BigQuery. dbt has just compiled our parametrised models into executable SQL files. The compiled files are under target/compiled/core/introduction. If you inspect any of the files, you will notice that their syntax is valid SQL.

compiled sql
Compiled SQL

But where are those variables located? The answer is in the dbt_project.yml file. The repository already contains this file. However when you initiate a new project through dbt init (not needed if you clone the repository), this file is created automatically with the basic sample structure. The configuration shown below instructs all the models placed under introduction folder to:

  • Be materialised as tables (unless different config is set within the SQL file).
  • Have available the variables dataset, click_stream_table, start_date and end_date set as the values shown above.
dbt_project.yml
dbt_project.yml

There are multiple ways to organise the settings and a read through the documentation would give you a much better idea of your options.

Before you proceed with executing SQL into BigQuery, you will need to setup your profile.yml file (https://docs.getdbt.com/docs/configure-your-profile#section-getting-started). It handles the connection settings with BigQuery; the repository contains a sample file (profile-sample.yml). You will need to copy the file in the correct path – on MacOS that is ~/.dbt/profiles.yml.

profile-sample.yml
Profile sample

Running dbt

How do you actually run the models? Easy; dbt run! Of course there is a wide list of options to use but for basic demonstration the above will suffice. If everything executes smoothly (authentication, datasets in place, permissions, valid Standard SQL etc.), you should see a similar output.

dbt run output
dbt run output

Note: In order for the execution to work, the profiles.yml is configured using oauth authentication. Make sure you have Google Cloud SDK installed.

All models have executed, the tables are built and you have a nice overview of the results. Note: the dataset configured in dbt_project.yml needs to be present in your GCP project.

bigquery tables
BigQuery Tables

Only 4 tables were created. The reason clickstream_full table is missing is due to the configuration {{ config( materialized=’ephemeral’) }} within clickstream_full.sql file that overwrites any setting indbt_project.yml. This means each model that references clickstream_full will use a temporary table. So no table/view is created for it.

Ephemeral Materialisation
Ephemeral Materialisation

Basic Metrics

Now that everything works, let’s go through the process to calculate some basic metrics. First we need to filter out all hits that should be excluded from reporting (i.e. bots, IP filtering, truncated hits) or are originating from non-web hits (i.e. data sources). To do so, within clickstream_clean.sql we set the following WHERE clauses (documentation):

  • exclude_hit = ‘0’ AND hit_source NOT IN (‘5′,’7′,’8′,’9’)

Next we calculate two different daily metrics, page views and visits (I defined them separately on purpose, explanation below). Each file references clickstream_clean using the ref function.

daily page views model
Daily Page Views model

The repository contains the exact files for both metrics. However remember that their date range depends on the values (start_date & end_date) configured in dbt_project.yml.

The best way to understand different dependencies is through the documentation generated by dbt. Now it is obvious that clickstream_clean depends on the root model clickstream_full (ephemeral model). Also its children are daily_page_views and daily_visits. If the clean dataset definition is modified (i.e. the SQL is updated), all its children will have to be updated (hint: this is managed magnificently by dbt!).

model graph
Model Graph – it derives for the usage of the ref() function.

This is just an idea of how you can structure your data workspace and start working with clickstream in a much better fashion.

Epilogue

Utilising SQL models using dbt makes the life of an analyst working through Adobe Analytics Clickstream data much easier and organised. There is huge potential and room for improvement in our daily analysis by taking advantage of such tools!

In the future I will be enhancing the models with more complicated and advanced cases! Let me know for any comments or thoughts!

PS: All code samples are at https://github.com/konosp/adobe-clickstream-dbt