Model Adobe Analytics Data in BigQuery with dbt – Into

Working with the raw Adobe Analytics data is always both interesting and challenging. One of the reasons is that Adobe (and all analytics vendors) have trained us to have easy access to multiple reports and metrics. However those rely on enormous amount of hours spent by some very intelligent people to built the products. So when we try to replicate a lot of those reports using SQL, it is only natural to – sometimes! – get discouraged. Data modelling such a data set and building manageable environment for analysis and reporting outside of the Adobe suite can be tricky. Is there any way to model adobe analytics data in bigquery?

When I discovered dbt (https://www.getdbt.com) I was very excited. To get an initial idea:

dbt (data build tool) is a command line tool that enables data analysts and engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.

dbt helps do the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

From dbt documentation

The above is really an understatement of what it offers:

  • Organise all your models’ logic in small SELECT statements along with their related dependencies
  • Deliver production workflows for incremental data updates based on simple SQL
  • Version control and share your work
  • Implement testing to ensure validity and integrity of your work
  • A great documentation repository

The basics of dbt

Explaining what dbt is might be harder than actually demonstrating it. To put it in words, you can define all your business logic in simple SELECT statements. Each statement will go in separate SQL files for bigger modularity.

data models sql
List of SQL models

However, each statement might (or definitely) will depend on other models. So within each SQL file you are specifying the dependencies. Usually, if you want to write a SELECT statement, you would specify the source of your data in the FROM clause. In dbt, you do the same but instead specify the file name.

single model definition
Simple model reference

This helps dbt understand the dependency tree and build tables in the appropriate order. So no matter how complicated logic you have, good governance can be enforced. After you have coded your business logic in different models, dbt takes care the execution. The same time it manages and notifies you about errors.

model execution
Multiple models execution

As soon as you take care of them (or just want to modify the SQL statement), you can re-run the updated model only.

single model execution
Single model execution

So far we have not specified where the models live. In this demo, I am using Google BigQuery. The above creates a table (or view) in BigQuery based on the model defined within the “multi_product_purchasers_top_10.sql” file.

visitor id mapping
Materialised BigQuery table

Getting your hands dirty with dbt

After we have a basic understanding of dbt, let’s see what else is possible.

  1. Multiple database/data warehouse options. Integrations with all major modern data warehouse solutions are offered; BigQuery, Redshift, Snowflake. Also support for Postgres and Spark is available.
  2. Different materialisation options (views, tables).
  3. Incremental updates. If you processes runs daily, only new/filtered data can be processed and be appended on your tables
  4. Production versus Development execution. Distinction between development and production environments helps you deploy only high quality models in production reports while you still work locally on experimental tables.
  5. Testing models. Automated QA process on the different tables’ attributes (uniqueness of values, non-null etc.).
  6. Programming-oriented features (i.e. loops, if statements, macros). Create intelligent and re-usable logic!
  7. Full documentation and knowledge governance features – see below.

More than a data built tool

One of biggest issues in complicated data environments is maintaining accurate and clear documentation. Helping data analysts understand where each table derives from, what kind of columns exist but also research for available tables based on keywords is a huge advantage. How many times where you reviewing a table but had no idea what the columns meant or where the data comes from?

With the native documentation functionality of dbt, you can easily manage the above issues. The best aspect however is using version control to enforce that only tables with full documentation are pushed to production. For example, below is the graph for the models previously described. Understanding relationships was never easier!

model graph
Model graph

Furthermore, searching through available tables and/or columns is fast and easy!

model documentation search
Search

Finally, as part of the code, table descriptions can be compiled and be part of the project documentation.

model documentation
Documentation

Starting with Adobe Analytics

So let’s see how the knowledge above can be applied in Adobe Analytics data. We have already demonstrated how to import and process data in Adobe Analytics. If you missed those posts, check at https://analyticsmayhem.com/tag/bigquery/.

One of the most common problems is proper visitor identification. Usually our visitors login mid-way through the journey and we store their customer ID on a traffic or conversion variable. Based on the available configuration, data cannot be backfilled to attribute success metrics that occurred prior to the authentication of the customer.

Let’s see how this can be solved with dbt

First we need to define a visitor mapping where each customer ID is associated with a visitor ID. This is our first model.

visitor id mapping
Visitor ID mapping

Next we need to update all records with customer ID (including those that occurred before the login).

visitor id backfilling
Joining Visitor ID mapping with all records

Let’s create another table to assess the impact of the data backfilling.

monitor model
Summary metrics model

When comparing the ratio of “identified” hits/rows between the original data set and the backfilled, we observe an improvement in the percent of hits that are associated with a customer ID. This will depend in the nature of your data but the main idea is that then you can setup an ongoing process to have ongoing backfilling take place.

results
Summary metrics results

Epilogue

Hope you are as excited about all the capabilities that BigQuery along with dbt can offer you Analytics team! If you are interested for more Adobe Analytics clickstream posts, please look at https://analyticsmayhem.com/tag/datafeeds/. Otherwise, if you are interested for support and assistance, happy to discuss more!