Incremental dbt models using Adobe Analytics Clickstream

In the previous article (Adobe Analytics Clickstream data with dbt & SQL) I discussed how to start with dbt and build basic models using Adobe Analytics Clickstream data. This time will dive into a more useful feature. Creating incremental dbt models using the same Adobe Analytics data. The motivation behind this is to make your data processing pipeline robust enough to:

  • handle re-uploaded files with modified data. Data Feed exports might fail or delay. And you need to work on that assumption. Being able to cater for delayed data delivery is important
  • avoid double-counting. Sometimes mistakes happen, the same data might be ingested more than once. Better prepare for that scenario and avoid duplicates entries
  • be idempotent. Every time a data processing step executes for a particular time frame, should generate the same results. So you can break down the pipeline execution in daily steps. No matter how many times you run the step for a particular day, the end result in the data should always be the same

The above assumes you already have a process to load your Data Feeds in a base table. If you want to setup a data loading process in BigQuery, feel free to contact at info@analyticsmayhem.com! Let’s get into it!

Merge Statement

Incremental models in dbt rely on using MERGE statement. For the purposes of this article, I am using BigQuery (MERGE documentation). DML stands for Data Manipulation Language and allows you to update, insert and delete data from a BigQuery table.

Even though using MERGE you can perform multiple operations, for the purposes of dbt the usage is more narrow. The samples below will help you better understand the actual process. “createTable.sql” will create a table with the single entry and “merge.sql” will insert 2 new rows but modify the existing one based on the unique id.

The result is row with id = ‘a’ will have its value updated from 1 to 2.

merge functionality demo
Merge on column “id” as unique id

In short, if we try to re-run a model as incremental, existing rows will update if needed and new ones will be inserted.

How to implement an incremental model

When you configure a dbt model as incremental, you can specify to insert or modify rows based on specific conditions. These are usually date/time constraints. Let’s examine the basic modifications from the previous article.

We will need a column that will serve as a unique identifier. Adobe provides in the Data Feeds the columns hitid_low and hitid_high. The concatenation of which generates a unique hit value. Let’s add this to our root clickstream model.

unique id
Generating unique_hit_id

Next we need to modify the clean and the discarded models accordingly. The below additions (line 3 and 4) will change the models to incremental using column unique_hit_id as the unique value.

model materialisation incremental
Model materialisation

In practise this is the equivalent of the clause destination_table.id = source_table.id (merge.sql – line 11) from the SQL samples.

We will also need to apply a condition to filter the rows. Since we are working with partitioned tables using date_time column, we need to filter the rows (and decrease costs of processing). To do so we will use a macro. Macros allow us to re-use code and enforce the DRY principle. Injecting a macro is easy!

macro placement
Incremental_filter() macro placement

This macro will filter rows based on the start_date and end_date of our execution period. We will define these parameters later on run-time.

macro definition
Incremental_filter() macro definition

In a simpler example, this would be the behaviour of the macro.

macro result
Macro result

Running incremental models

How do we actually run all this? Previously we had defined in the dbt_project.yml file two variables; start_date and end_date. Those are referenced in the incremental_filter() macro.

dbt_project variables
dbt_project.yml

We can overwrite their values during dbt execution using the “–vars” argument. This way we programmatically control which dates we want the process to run for. Or we can build a pipeline that will run daily and process only data from the previous day.

dbt run --models clickstream_clean --vars '{"start_date":"2020-02-15","end_date":"2020-02-16"}'

Final notes

  • Running dbt in full-refresh mode will essentially drop the destination table and recreate it.
  • You can create any kind of condition when applying the incremental filtering. In the most basic examples, the only check is that the event_time of the incoming rows is bigger than the maximum value of the existing rows.
  • If you had to re-upload data for a particular period in your clickstream table, you would need to update all downstream dependent models. To do run dbt run –models clickstream+ –var ‘{“start_date”:”<period start>”,”end_date”:”<period end>”}’. Notice the plus sign “+” after clickstream!

You can see the changes from the previous article on this pull request: https://github.com/konosp/adobe-clickstream-dbt/pull/2/files and the full repository at https://github.com/konosp/adobe-clickstream-dbt

Adobe Analytics Clickstream data with dbt & SQL – Guide

Have previously elaborated on both the usefulness of working with raw clickstream data and how powerful I consider dbt (data build tool) in managing SQL models. This time I will take a more practical approach to go through the basic setup to build some initial SQL models in BigQuery based on the Adobe Analytics Clickstream data. The topics to cover are:

  • Loading Adobe Analytics data
  • Setup
  • Compiling SQL files
  • Running dbt
  • Basic Metrics

Loading Adobe Analytics data

Before we get started, you need to have Clickstream data loaded in a BigQuery dataset. If you have not already done so, you can find more information about the topic at Load Data Feeds to Google BigQuery to understand the key steps and configurations of the process.

Furthermore, if you are not Project Admin in the GCP project, make sure you have the following permissions (are needed to execute dbt):

  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User

Setup

First you need to install the library; you have a few options available (https://docs.getdbt.com/docs/installation). For the remainder of this guide, I have utilised pip as an installation method on MacOS. After you have installed it, clone the repository from https://github.com/konosp/adobe-clickstream-dbt. Navigate into adobe-clickstream-dbt. The path models/introduction contains 5 SQL files; those are dynamic files based on Jinja – a templating language.

Let’s examine the first one; clickstream_full.sql. You will notice it looks like a regular SQL file that queries a typical Data feed file from Adobe Analytics. The key point is anything between double curly brackets gets parsed by Jinja. config is capturing model-specific configurations i.e. type of materialisation (view/table/nested). var are dynamic variables (see below) and ref are references to other models. References help dbt assess model hierarchy and appropriate order of execution.

sample model
Sample model

Compiling SQL files

Similar structure exists in the rest of the models. Now execute in terminal dbt compile. If the output is similar to below, the library installation is completed successfully and the models are discoverable.

dbt compile output
dbt compile output

However this does not mean the queries have been executed on BigQuery. dbt has just compiled our parametrised models into executable SQL files. The compiled files are under target/compiled/core/introduction. If you inspect any of the files, you will notice that their syntax is valid SQL.

compiled sql
Compiled SQL

But where are those variables located? The answer is in the dbt_project.yml file. The repository already contains this file. However when you initiate a new project through dbt init (not needed if you clone the repository), this file is created automatically with the basic sample structure. The configuration shown below instructs all the models placed under introduction folder to:

  • Be materialised as tables (unless different config is set within the SQL file).
  • Have available the variables dataset, click_stream_table, start_date and end_date set as the values shown above.
dbt_project.yml
dbt_project.yml

There are multiple ways to organise the settings and a read through the documentation would give you a much better idea of your options.

Before you proceed with executing SQL into BigQuery, you will need to setup your profile.yml file (https://docs.getdbt.com/docs/configure-your-profile#section-getting-started). It handles the connection settings with BigQuery; the repository contains a sample file (profile-sample.yml). You will need to copy the file in the correct path – on MacOS that is ~/.dbt/profiles.yml.

profile-sample.yml
Profile sample

Running dbt

How do you actually run the models? Easy; dbt run! Of course there is a wide list of options to use but for basic demonstration the above will suffice. If everything executes smoothly (authentication, datasets in place, permissions, valid Standard SQL etc.), you should see a similar output.

dbt run output
dbt run output

Note: In order for the execution to work, the profiles.yml is configured using oauth authentication. Make sure you have Google Cloud SDK installed.

All models have executed, the tables are built and you have a nice overview of the results. Note: the dataset configured in dbt_project.yml needs to be present in your GCP project.

bigquery tables
BigQuery Tables

Only 4 tables were created. The reason clickstream_full table is missing is due to the configuration {{ config( materialized=’ephemeral’) }} within clickstream_full.sql file that overwrites any setting indbt_project.yml. This means each model that references clickstream_full will use a temporary table. So no table/view is created for it.

Ephemeral Materialisation
Ephemeral Materialisation

Basic Metrics

Now that everything works, let’s go through the process to calculate some basic metrics. First we need to filter out all hits that should be excluded from reporting (i.e. bots, IP filtering, truncated hits) or are originating from non-web hits (i.e. data sources). To do so, within clickstream_clean.sql we set the following WHERE clauses (documentation):

  • exclude_hit = ‘0’ AND hit_source NOT IN (‘5′,’7′,’8′,’9’)

Next we calculate two different daily metrics, page views and visits (I defined them separately on purpose, explanation below). Each file references clickstream_clean using the ref function.

daily page views model
Daily Page Views model

The repository contains the exact files for both metrics. However remember that their date range depends on the values (start_date & end_date) configured in dbt_project.yml.

The best way to understand different dependencies is through the documentation generated by dbt. Now it is obvious that clickstream_clean depends on the root model clickstream_full (ephemeral model). Also its children are daily_page_views and daily_visits. If the clean dataset definition is modified (i.e. the SQL is updated), all its children will have to be updated (hint: this is managed magnificently by dbt!).

model graph
Model Graph – it derives for the usage of the ref() function.

This is just an idea of how you can structure your data workspace and start working with clickstream in a much better fashion.

Epilogue

Utilising SQL models using dbt makes the life of an analyst working through Adobe Analytics Clickstream data much easier and organised. There is huge potential and room for improvement in our daily analysis by taking advantage of such tools!

In the future I will be enhancing the models with more complicated and advanced cases! Let me know for any comments or thoughts!

PS: All code samples are at https://github.com/konosp/adobe-clickstream-dbt

Model Adobe Analytics Data in BigQuery with dbt – Into

Working with the raw Adobe Analytics data is always both interesting and challenging. One of the reasons is that Adobe (and all analytics vendors) have trained us to have easy access to multiple reports and metrics. However those rely on enormous amount of hours spent by some very intelligent people to built the products. So when we try to replicate a lot of those reports using SQL, it is only natural to – sometimes! – get discouraged. Data modelling such a data set and building manageable environment for analysis and reporting outside of the Adobe suite can be tricky. Is there any way to model adobe analytics data in bigquery?

When I discovered dbt (https://www.getdbt.com) I was very excited. To get an initial idea:

dbt (data build tool) is a command line tool that enables data analysts and engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.

dbt helps do the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

From dbt documentation

The above is really an understatement of what it offers:

  • Organise all your models’ logic in small SELECT statements along with their related dependencies
  • Deliver production workflows for incremental data updates based on simple SQL
  • Version control and share your work
  • Implement testing to ensure validity and integrity of your work
  • A great documentation repository

The basics of dbt

Explaining what dbt is might be harder than actually demonstrating it. To put it in words, you can define all your business logic in simple SELECT statements. Each statement will go in separate SQL files for bigger modularity.

data models sql
List of SQL models

However, each statement might (or definitely) will depend on other models. So within each SQL file you are specifying the dependencies. Usually, if you want to write a SELECT statement, you would specify the source of your data in the FROM clause. In dbt, you do the same but instead specify the file name.

single model definition
Simple model reference

This helps dbt understand the dependency tree and build tables in the appropriate order. So no matter how complicated logic you have, good governance can be enforced. After you have coded your business logic in different models, dbt takes care the execution. The same time it manages and notifies you about errors.

model execution
Multiple models execution

As soon as you take care of them (or just want to modify the SQL statement), you can re-run the updated model only.

single model execution
Single model execution

So far we have not specified where the models live. In this demo, I am using Google BigQuery. The above creates a table (or view) in BigQuery based on the model defined within the “multi_product_purchasers_top_10.sql” file.

visitor id mapping
Materialised BigQuery table

Getting your hands dirty with dbt

After we have a basic understanding of dbt, let’s see what else is possible.

  1. Multiple database/data warehouse options. Integrations with all major modern data warehouse solutions are offered; BigQuery, Redshift, Snowflake. Also support for Postgres and Spark is available.
  2. Different materialisation options (views, tables).
  3. Incremental updates. If you processes runs daily, only new/filtered data can be processed and be appended on your tables
  4. Production versus Development execution. Distinction between development and production environments helps you deploy only high quality models in production reports while you still work locally on experimental tables.
  5. Testing models. Automated QA process on the different tables’ attributes (uniqueness of values, non-null etc.).
  6. Programming-oriented features (i.e. loops, if statements, macros). Create intelligent and re-usable logic!
  7. Full documentation and knowledge governance features – see below.

More than a data built tool

One of biggest issues in complicated data environments is maintaining accurate and clear documentation. Helping data analysts understand where each table derives from, what kind of columns exist but also research for available tables based on keywords is a huge advantage. How many times where you reviewing a table but had no idea what the columns meant or where the data comes from?

With the native documentation functionality of dbt, you can easily manage the above issues. The best aspect however is using version control to enforce that only tables with full documentation are pushed to production. For example, below is the graph for the models previously described. Understanding relationships was never easier!

model graph
Model graph

Furthermore, searching through available tables and/or columns is fast and easy!

model documentation search
Search

Finally, as part of the code, table descriptions can be compiled and be part of the project documentation.

model documentation
Documentation

Starting with Adobe Analytics

So let’s see how the knowledge above can be applied in Adobe Analytics data. We have already demonstrated how to import and process data in Adobe Analytics. If you missed those posts, check at https://analyticsmayhem.com/tag/bigquery/.

One of the most common problems is proper visitor identification. Usually our visitors login mid-way through the journey and we store their customer ID on a traffic or conversion variable. Based on the available configuration, data cannot be backfilled to attribute success metrics that occurred prior to the authentication of the customer.

Let’s see how this can be solved with dbt

First we need to define a visitor mapping where each customer ID is associated with a visitor ID. This is our first model.

visitor id mapping
Visitor ID mapping

Next we need to update all records with customer ID (including those that occurred before the login).

visitor id backfilling
Joining Visitor ID mapping with all records

Let’s create another table to assess the impact of the data backfilling.

monitor model
Summary metrics model

When comparing the ratio of “identified” hits/rows between the original data set and the backfilled, we observe an improvement in the percent of hits that are associated with a customer ID. This will depend in the nature of your data but the main idea is that then you can setup an ongoing process to have ongoing backfilling take place.

results
Summary metrics results

Epilogue

Hope you are as excited about all the capabilities that BigQuery along with dbt can offer you Analytics team! If you are interested for more Adobe Analytics clickstream posts, please look at https://analyticsmayhem.com/tag/datafeeds/. Otherwise, if you are interested for support and assistance, happy to discuss more!