What is Adobe Analytics

In today’s digital world, all companies rely on a digital presence to promote their brand, sell products and engage with their customers. What is Adobe Analytics and how it can help in achieving your business goals?

Let’s get into it! Hopefully after this you will have an initial understanding about Adobe Analytics, Web Analytics in general and if this is a path you want to explore further.

Do you need a Digital Analytics solution?

Let’s first try to understand what a stakeholder in a e-Commerce company needs. A general requirement would be to “maximise the ROI of the digital investment (e-shop, mobile app etc.)”. How does this translate to your particular case is subjective. Most of the time this is revenue maximisation, ARPU increase, customer retention, NPS or something similar. Someone will argue this information already exists in the CRM system, a transactional database or in the 3rd party that sends surveys. This is true but by itself does not actually help to improve those core metrics. How are we supposed to maximise:

  • Revenue if we do not know how customers navigate around our shop?
  • Conversion rate if we do not know what customers did before making (or not making) any purchase?
  • ARPU if we are not aware of what products the customers show interest at over time?
  • Retention and NPS if we do not learn their pain points and frustrations as they are browse?

We have to remember that people are actually visiting our brand, they showed intent. We should be trying to make their time on our brand as pleasant and smooth as possible! Noble cause I would say! So we need to have rich information about what they do between them arriving on our site and the actual sales (if any).

We need to be able to understand the different actions customers took. In which part of their journey they most commonly abandon our brand. How often they visit the complains page after they made a purchase? Having this view, creates the field to start asking questions and look for answers that are otherwise impossible.

Even if you ask directly your customers, the results will be from misleading (at best) to straight lies (at worst). The visitors reveal more about our brand through their actions rather than asking them. This is especially true in sales. Nothing speaks best than a person reaching for the wallet!

Adobe Analytics

Let’s try to understand what Adobe Analytics is. Analytics is one of the core offerings from Adobe and it is part of the Adobe Cloud suite. As a collection of products, there are numerous functionalities that are available but here will focus on Adobe Analytics solely.

The main purpose of Analytics is to provide a comprehensive solution to help you understand web behavioural data. It is an enterprise solution (i.e. features, SLAs, access control, costs) that enables you to collect data, store it, process it and visualise it in an easy-to-manage fashion. It is an enabler to quickly view the current performance and draw decisions supported by data.

adobe workspace sample dashboard
Sample dashboard from www.cio.com

If your site has a multi-step funnel, how do you know which step is underperforming? If you do not, chances are you will have to do user research, ask individual users or maybe your own employees. How much this reflects reality is something we will never know. Imagine being able to have a view for the biggest portion of your visitors? That would allow you to draw safer conclusions.

workspace fallout report
Fallout report – Adobe.com

The only way to know that there is more than 80% drop rate in the last step of the funnel is to have the above report. Slicing and dicing the customer behaviour based on multiple data points is the key to perform such tasks. Even the most basic Adobe Analytics package offers hundreds of data points. Those can be customised to help you build such reports. Then you can focus your energy and time where it matters the most. Taking action!

Useful visualisations

If you want to view in which order visitors navigate through different sections, a flow visualisation will prove useful.

workspace flow report
Flow report from adobe.com

On the other hand, if your primary consideration is to compare between key segments you also have the flexibility to do so.

workspace tabular report
Segment comparison from adobe.com

Adobe Workspace is the key analysis solution for interactive and exploratory work that is built with flexibility in mind. Combining multiple reports, graphs and data points is made really easy.

workspace sample dashboard enhanced
Adobe Workspace visualisation from adobe.com

To make the above visualisations really insightful, it depends on your appetite to become data-driven. It is up to you to increase the number of data points you have available in Analytics.

The number of options to view, share and analyse the said data points is also big. It can enable both you and other stakeholders in your business to understand your customer from multiple perspectives. The best thing is it removes a huge burden from your BI and IT teams to obtain, store and analyse this data. The process is much more streamlined and hides all the data engineering complexities in order to make decisions faster. The video illustrates this very well.

What about Google Analytics?

How is this different to Google Analytics (or other web analytics solutions) you might ask. The main thing to understand is that those two products originate from different companies with different business models and backgrounds. The key difference is their target audience. Google Analytics is probably more suited to SMEs while Adobe Analytics is best for big companies. Adobe offers a lot more corporate-related functionalities and has much fewer limitations in terms of analytical features. This does not mean that big corporations cannot extract value from the product Google offers. In a lot of cases this also depends on the skills, maturity and requirements of your Digital Analytics team and their stakeholders. To go through such comparison, it needs a series of posts so I will not touch it at this time!

Summary

Hopefully this is a good first flavour of what is Adobe Analytics. If you have further questions, need support or you would like to share an idea, feel free to contact.

Adobe Analytics Clickstream data with dbt & SQL – Guide

Have previously elaborated on both the usefulness of working with raw clickstream data and how powerful I consider dbt (data build tool) in managing SQL models. This time I will take a more practical approach to go through the basic setup to build some initial SQL models in BigQuery based on the Adobe Analytics Clickstream data. The topics to cover are:

  • Loading Adobe Analytics data
  • Setup
  • Compiling SQL files
  • Running dbt
  • Basic Metrics

Loading Adobe Analytics data

Before we get started, you need to have Clickstream data loaded in a BigQuery dataset. If you have not already done so, you can find more information about the topic at Load Data Feeds to Google BigQuery to understand the key steps and configurations of the process.

Furthermore, if you are not Project Admin in the GCP project, make sure you have the following permissions (are needed to execute dbt):

  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User

Setup

First you need to install the library; you have a few options available (https://docs.getdbt.com/docs/installation). For the remainder of this guide, I have utilised pip as an installation method on MacOS. After you have installed it, clone the repository from https://github.com/konosp/adobe-clickstream-dbt. Navigate into adobe-clickstream-dbt. The path models/introduction contains 5 SQL files; those are dynamic files based on Jinja – a templating language.

Let’s examine the first one; clickstream_full.sql. You will notice it looks like a regular SQL file that queries a typical Data feed file from Adobe Analytics. The key point is anything between double curly brackets gets parsed by Jinja. config is capturing model-specific configurations i.e. type of materialisation (view/table/nested). var are dynamic variables (see below) and ref are references to other models. References help dbt assess model hierarchy and appropriate order of execution.

sample model
Sample model

Compiling SQL files

Similar structure exists in the rest of the models. Now execute in terminal dbt compile. If the output is similar to below, the library installation is completed successfully and the models are discoverable.

dbt compile output
dbt compile output

However this does not mean the queries have been executed on BigQuery. dbt has just compiled our parametrised models into executable SQL files. The compiled files are under target/compiled/core/introduction. If you inspect any of the files, you will notice that their syntax is valid SQL.

compiled sql
Compiled SQL

But where are those variables located? The answer is in the dbt_project.yml file. The repository already contains this file. However when you initiate a new project through dbt init (not needed if you clone the repository), this file is created automatically with the basic sample structure. The configuration shown below instructs all the models placed under introduction folder to:

  • Be materialised as tables (unless different config is set within the SQL file).
  • Have available the variables dataset, click_stream_table, start_date and end_date set as the values shown above.
dbt_project.yml
dbt_project.yml

There are multiple ways to organise the settings and a read through the documentation would give you a much better idea of your options.

Before you proceed with executing SQL into BigQuery, you will need to setup your profile.yml file (https://docs.getdbt.com/docs/configure-your-profile#section-getting-started). It handles the connection settings with BigQuery; the repository contains a sample file (profile-sample.yml). You will need to copy the file in the correct path – on MacOS that is ~/.dbt/profiles.yml.

profile-sample.yml
Profile sample

Running dbt

How do you actually run the models? Easy; dbt run! Of course there is a wide list of options to use but for basic demonstration the above will suffice. If everything executes smoothly (authentication, datasets in place, permissions, valid Standard SQL etc.), you should see a similar output.

dbt run output
dbt run output

Note: In order for the execution to work, the profiles.yml is configured using oauth authentication. Make sure you have Google Cloud SDK installed.

All models have executed, the tables are built and you have a nice overview of the results. Note: the dataset configured in dbt_project.yml needs to be present in your GCP project.

bigquery tables
BigQuery Tables

Only 4 tables were created. The reason clickstream_full table is missing is due to the configuration {{ config( materialized=’ephemeral’) }} within clickstream_full.sql file that overwrites any setting indbt_project.yml. This means each model that references clickstream_full will use a temporary table. So no table/view is created for it.

Ephemeral Materialisation
Ephemeral Materialisation

Basic Metrics

Now that everything works, let’s go through the process to calculate some basic metrics. First we need to filter out all hits that should be excluded from reporting (i.e. bots, IP filtering, truncated hits) or are originating from non-web hits (i.e. data sources). To do so, within clickstream_clean.sql we set the following WHERE clauses (documentation):

  • exclude_hit = ‘0’ AND hit_source NOT IN (‘5′,’7′,’8′,’9’)

Next we calculate two different daily metrics, page views and visits (I defined them separately on purpose, explanation below). Each file references clickstream_clean using the ref function.

daily page views model
Daily Page Views model

The repository contains the exact files for both metrics. However remember that their date range depends on the values (start_date & end_date) configured in dbt_project.yml.

The best way to understand different dependencies is through the documentation generated by dbt. Now it is obvious that clickstream_clean depends on the root model clickstream_full (ephemeral model). Also its children are daily_page_views and daily_visits. If the clean dataset definition is modified (i.e. the SQL is updated), all its children will have to be updated (hint: this is managed magnificently by dbt!).

model graph
Model Graph – it derives for the usage of the ref() function.

This is just an idea of how you can structure your data workspace and start working with clickstream in a much better fashion.

Epilogue

Utilising SQL models using dbt makes the life of an analyst working through Adobe Analytics Clickstream data much easier and organised. There is huge potential and room for improvement in our daily analysis by taking advantage of such tools!

In the future I will be enhancing the models with more complicated and advanced cases! Let me know for any comments or thoughts!

PS: All code samples are at https://github.com/konosp/adobe-clickstream-dbt

Model Adobe Analytics Data in BigQuery with dbt – Into

Working with the raw Adobe Analytics data is always both interesting and challenging. One of the reasons is that Adobe (and all analytics vendors) have trained us to have easy access to multiple reports and metrics. However those rely on enormous amount of hours spent by some very intelligent people to built the products. So when we try to replicate a lot of those reports using SQL, it is only natural to – sometimes! – get discouraged. Data modelling such a data set and building manageable environment for analysis and reporting outside of the Adobe suite can be tricky. Is there any way to model adobe analytics data in bigquery?

When I discovered dbt (https://www.getdbt.com) I was very excited. To get an initial idea:

dbt (data build tool) is a command line tool that enables data analysts and engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.

dbt helps do the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

From dbt documentation

The above is really an understatement of what it offers:

  • Organise all your models’ logic in small SELECT statements along with their related dependencies
  • Deliver production workflows for incremental data updates based on simple SQL
  • Version control and share your work
  • Implement testing to ensure validity and integrity of your work
  • A great documentation repository

The basics of dbt

Explaining what dbt is might be harder than actually demonstrating it. To put it in words, you can define all your business logic in simple SELECT statements. Each statement will go in separate SQL files for bigger modularity.

data models sql
List of SQL models

However, each statement might (or definitely) will depend on other models. So within each SQL file you are specifying the dependencies. Usually, if you want to write a SELECT statement, you would specify the source of your data in the FROM clause. In dbt, you do the same but instead specify the file name.

single model definition
Simple model reference

This helps dbt understand the dependency tree and build tables in the appropriate order. So no matter how complicated logic you have, good governance can be enforced. After you have coded your business logic in different models, dbt takes care the execution. The same time it manages and notifies you about errors.

model execution
Multiple models execution

As soon as you take care of them (or just want to modify the SQL statement), you can re-run the updated model only.

single model execution
Single model execution

So far we have not specified where the models live. In this demo, I am using Google BigQuery. The above creates a table (or view) in BigQuery based on the model defined within the “multi_product_purchasers_top_10.sql” file.

visitor id mapping
Materialised BigQuery table

Getting your hands dirty with dbt

After we have a basic understanding of dbt, let’s see what else is possible.

  1. Multiple database/data warehouse options. Integrations with all major modern data warehouse solutions are offered; BigQuery, Redshift, Snowflake. Also support for Postgres and Spark is available.
  2. Different materialisation options (views, tables).
  3. Incremental updates. If you processes runs daily, only new/filtered data can be processed and be appended on your tables
  4. Production versus Development execution. Distinction between development and production environments helps you deploy only high quality models in production reports while you still work locally on experimental tables.
  5. Testing models. Automated QA process on the different tables’ attributes (uniqueness of values, non-null etc.).
  6. Programming-oriented features (i.e. loops, if statements, macros). Create intelligent and re-usable logic!
  7. Full documentation and knowledge governance features – see below.

More than a data built tool

One of biggest issues in complicated data environments is maintaining accurate and clear documentation. Helping data analysts understand where each table derives from, what kind of columns exist but also research for available tables based on keywords is a huge advantage. How many times where you reviewing a table but had no idea what the columns meant or where the data comes from?

With the native documentation functionality of dbt, you can easily manage the above issues. The best aspect however is using version control to enforce that only tables with full documentation are pushed to production. For example, below is the graph for the models previously described. Understanding relationships was never easier!

model graph
Model graph

Furthermore, searching through available tables and/or columns is fast and easy!

model documentation search
Search

Finally, as part of the code, table descriptions can be compiled and be part of the project documentation.

model documentation
Documentation

Starting with Adobe Analytics

So let’s see how the knowledge above can be applied in Adobe Analytics data. We have already demonstrated how to import and process data in Adobe Analytics. If you missed those posts, check at https://analyticsmayhem.com/tag/bigquery/.

One of the most common problems is proper visitor identification. Usually our visitors login mid-way through the journey and we store their customer ID on a traffic or conversion variable. Based on the available configuration, data cannot be backfilled to attribute success metrics that occurred prior to the authentication of the customer.

Let’s see how this can be solved with dbt

First we need to define a visitor mapping where each customer ID is associated with a visitor ID. This is our first model.

visitor id mapping
Visitor ID mapping

Next we need to update all records with customer ID (including those that occurred before the login).

visitor id backfilling
Joining Visitor ID mapping with all records

Let’s create another table to assess the impact of the data backfilling.

monitor model
Summary metrics model

When comparing the ratio of “identified” hits/rows between the original data set and the backfilled, we observe an improvement in the percent of hits that are associated with a customer ID. This will depend in the nature of your data but the main idea is that then you can setup an ongoing process to have ongoing backfilling take place.

results
Summary metrics results

Epilogue

Hope you are as excited about all the capabilities that BigQuery along with dbt can offer you Analytics team! If you are interested for more Adobe Analytics clickstream posts, please look at https://analyticsmayhem.com/tag/datafeeds/. Otherwise, if you are interested for support and assistance, happy to discuss more!

Adobe Analytics and time between funnel steps

Recently I had to deal with the question “How much time it takes for a customer to go through funnel steps?” Simple question on the face of it. But the more I was thinking about it, the more complicated it was becoming. So what is the best way to measure and report using Adobe Analytics the time required for your visitors to go through different funnel steps?

The main requirement that I had to meet was something among the lines below:

Requirements
Requirements

I am going to demonstrate a few alternatives approaches to answering this question. Depending on your resources, setup and requirements you have a few options to consider:

  1. Adobe Analytics out-of-the-box report
  2. Adobe Analytics plugin
  3. Data collection customised solution
  4. Adobe Data Feeds and SQL

1) Adobe Analytics Out of the box report

The easiest way to answer such a question would be to use the Adobe Analytics default report “Time prior to event”. It calculates the amount of time required for a success event to occur within a visit. However the report is not really practical because the timing values are returned as dimension values (and grouped in buckets). This is not really useful if you want to have a trended analysis i.e. monitor the over-time performance of customers performing a certain action. If this was possible, this would be the best combo of ease-of-implementation versus utility solution.

Time Prior to Event report
Time Prior to Event sample

2) Adobe Analytics plugin

There is a plugin to evaluate the differenc between two success events; “getTimeToComplete“. On the surface it looks nice solution. However it is hard to ignore some key limitations:

  1. It calculates the difference between two events only. You have to code different logic for different time difference events. I.e. one block for “Success Event 12” – “Success Event 10”, different for “Success Event 40” – “Success Event 31”.
  2. Suggests to store the value in a dimension, good for classification but not so good for trended analysis.

There is always the option to modify the plugin and customise it but if you are willing to do down that path, please read below!

3) Data collection custom solution

This is where things can get complicated! The more I was thinking the process, the more I was coming up with caveats that needed to be addressed. The initial success events and prop configured would be:

  • 1 success event will capture the time difference in seconds from the “Start” of the journey; numeric event
  • 1 success event will capture the the total number of instances a measurement was recorded
  • 1 prop will capture the current step i.e. “step 1”, “step 2”, “end”. This will be utilised later on for hit-based segmentation to isolate the measurements for each individual step

Now let’s discuss the considerations:

  • Each measurement would need to be captured once per a logical time frame (or session). This is required to avoid counting multiple times cases where users go through certain steps of the process multiple times.
  • If the user comes back to the site after that logical time frame, we would like to capture the measurements.
  • Need to define explicitly all the steps of the process in order to adjust the data collection process. That has the inherent limitation of no historical data and low flexibility.

In order to deal with that, I had to use a concept called journey. The journey has an explicit starting point and fixed steps. We need to know those in order to calculate the various time differences. A journey:

  1. Will start once and has a certain duration.
  2. During that time period, it cannot restart
  3. During the same time period, each measurement will be captured only once.
  4. Every time the user goes through the steps, the journey is being checked if it has expired. If yes, then it is eligible to be restarted and all subsequent steps to be measured.

As you can see, there are already a lot of constraints and limitations. But let’s see how this looks visually. The below highlights the basic steps that a user would have to go through. Once the “Registration Start” step is triggered, the journey is being initiated and the expiration is set 30 minutes (or any time frame that works for you) in the future. Then each step is recorded only once during that journey.

Journey
Journey

In order to achieve that, the below code was required. To be honest, the measurements were surprisingly good but as you can see from the code the implementation is far from straight-forward. All the logic, considerations and constraints had to be implemented. Have a look:

But how the results looked like? Is this something that can be used? Actually it can; with some limitations! Since all the measurements are captured using 2 success events and a prop, we need some segmentation to get the numbers. Particularly, to calculate the time required to get from the “start” of the journey to each step we need do as follows. First we calculate the average time since the start of registration.

Calculated metric
Calculating the average time since the journey start

After that, we segment on hit-level using the prop previously mentioned. This will isolate the measurements for the time required to reach each step.

Reporting configuration of metrics
Reporting configuration

The above reporting configuration can lead to a visualisation like below.

Trended reporting of time measurements
Trended view of “Time Since Start” for different steps.

Unfortunately the above calculations are based on the mean value. That means that they are very sensitive to extremely high measurements, and there are a lot. In summary:

  1. No historical data
  2. Measurements heavily depend on data collection
  3. No flexibility for custom funnels
  4. High sensitivity to high values

The initial approach could provide some good basis for monitoring the time differences between certain actions of interest. However all the work above seems like a bit too much. Let’s explore an alternative!

4) Adobe Analytics Data Feeds and SQL

Having mentioned all the methods above, they feel limiting versus my preferred option. Working with Adobe Data Feeds using SQL! I have been arguing in favour of working with Data Feeds for some time now. The are very flexible and will open to you a whole new world to explore. In order to find out more details about loading data and working in BigQuery, please visit my previous articles on it:

Let’s start with the ideal result first, what we really need! The table below provides indicative measurements where we compare the Mean and Median time required to reach each registration step from the first step. Additionally we can see the median values are much lower highlight a previously mentioned limitation; average values are highly sensitive to extreme measurements. PS: Numbers are not real, I fiddled with them. nevertheless the principle still applies.

Funnel StepMeanMedian
registration: step 2852494
registration: step 31346935
registration: step 415491080
end21011380

But how do we get there? We will break down some key steps. The GitHub Gist below is organised in 4 separate SQL queries, each one meant to be executed on the back of the results of the previous one.

Let’s analyse them one by one.

  • 1_filterRows.sql: This is based on your implementation by the key idea is we want to isolate the rows that each individual step is being recorded. You might need to filter based on a different prop, eVar or event. However you always need the visitor ID, visit number, step name and timestamp
  • 2_calculateStepStart.sql: Grouping by visitor ID and visit number in order to obtain the earliest timing that each step of the journey took place
  • 3_calculateTimeDifferences.sql: Calculate the time difference for each individual step since the earliest measurement per visit & visitor ID
  • 4_calculateMeasurements.sql: Calculate per funnel step the median and mean values

Hope you have found the overview helpful! Given you have the process in place to load Adobe Analytics Data Feeds into a database, you are able to generate the same results much easier.

As I have many times re-iterated, working with the raw clickstream data is the way forward for the analysts! Have fun!

Advanced cases of Adobe Clickstream with BigQuery

In the previous post provided a few basic examples on how to start your journey analysing Adobe Analytics Clickstream Data Feeds in BigQuery. This time going to complete the remaining points of the agenda. Overall we have:

  1. Hit-based segments (part 1)
  2. Visit-based segments (part 1)
  3. Visitor-based segments (part 1)
  4. Sequential segments (part 2)
  5. Success Events (part 2)
  6. Products and merchandising eVars (part 2)

You can review points 1-3 in the previous part in case you missed it. Let’s continue with the rest of it!

4. Sequential segments

In order to implement a sequential segment, we need take the raw data and self-join the table on the commonly used IDs mentioned before (i.e. post_visid_high, post_visid_low, visit_num). However we need to apply the time-based constraints i.e. on step 1 the customer visited a product URL and on step 2 a checkout URL. This can be easily done using a time comparison on the date_time column. Eventually we can extract in a separate table post_visid_high, post_visid_low and visit_num values that satisfy the constrains.

SELECT
   step1.post_visid_high,
   step1.post_visid_low,
   step1.visit_num -- Visit container
FROM
   `[dataset id].[table name]` AS step1, -- self-join
   `[dataset id].[table name]` AS step2
WHERE
   step1.date_time > '2018-10-01'
   AND step1.date_time < '2018-10-05'
   AND step2.date_time > '2018-10-01'
   AND step2.date_time < '2018-10-05'
   AND step1.date_time < step2.date_time -- Time-based Comparison
   AND hit_segment(step1.page_url,'contains','/product/')
   AND hit_segment(step2.page_url,'contains','/checkout/')
   AND step1.post_visid_high = step2.post_visid_high
   AND step1.post_visid_low = step2.post_visid_low
   AND step1.visit_num = step2.visit_num -- Visit container
GROUP BY
   step1.post_visid_high,
   step1.post_visid_low,
   step1.visit_num -- Visit container

A couple of notes:

  1. The date time conditions are needed because we are querying a time-partitioned table. If you opt-out of such configuration (not recommended), then you can ignore them.
  2. A key comparison is "step1.date_time < step2.date_time". This enforces the logic that the customer first took action A and then action B.

The results are a set of visitor IDs/visit numbers which meeting the segment criteria. We can save it an intermediate table and then join with the row data on all columns. If we remove the column "visit_num", the results will become a sequential visitor segment.

5. Success Events

Manipulating success events requires a bit more work. First you need to re-format the data and then know what each number represents as well. Let's have a first look at success events:

SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list
FROM
   `[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

The columns post_visid_high/low, visit_num are familiar but the event_list is a comma-delimited list of numbers:

This is not practical for two reasons; you cannot really count the number of times each event occurs or perform any analytical operation and you actually cannot tell what the above numbers mean (185 anyone?) unless you know them by heart.

Extracts from Adobe Data Feeds come with a convenient mapping file (event_list.tsv) that maps the numbers to something more meaningful, especially for out-of-the box events. The neat thing is that the events_list column contains instances of eVarXX. This is handy if you want to analyse hits where an instance of eVar is set or just count/compare instances (for this you need to use the post version of the events_list).

Having said the above, two are the key BigQuery features needed to properly process the events:

  • SPLIT function: Since the event_list is a comma-delimited string value, we can split the values as a first step so we can have each individual event in a separate cell. The syntax of the function is straight-forward:
SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list,
   split(event_list,',') as splitted_events_list
FROM`[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

The outcome is that the initial row is converted into an array:

However this is still not enough. The "splitted_events_list" column above technically contains in each cell an array. So in rows 1 & 2 event 188 is in cell 2 while in rows 3 & 4 it is in cell 3. The more complicated the data gets, the harder it is to analyse it. That leads us to the next point!

  • UNNEST function: After we have generated the array, now we need to flatten it. At this stage BigQuery cross-joins each row with each element of the array in splitted_events_list column that corresponds to the that row:
SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list,
   events_formated
FROM
   `[dataset id].[table name]`, 
   unnest(split(event_list,',')) as events_formated
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

So from the previous table, we end up having one entry per post_visid_high/low and value in the event_list (i.e. now we have a full row for event 188 for all cases that the event appears):

To make things easier we can join with the "event" file (the key and column values were used as column headers during import):

SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   events_formated,
   mapping.value 
FROM
   `[dataset id].[table name]` as hits, 
   unnest(split(event_list,',')) as events_formated
JOIN 
   `[dataset id].event` as mapping
ON 
   events_formated = mapping.key 
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

Now it is much easier to filter and work on individual events. Following the same logic we used in segments, you can generate visitor or visit IDs that have a specific event or volume of events:

Bonus point: As you observe above, not everything is as clean as it should be. Who knows what does Custom Event 7 mean?! Apart from the Analytics team, nobody! You can create an external table maintained in Google Drive that holds descriptive values and join them thus providing a more user-friendly view.

6. Products and merchandising eVars

Admittedly, the products variable is the most complicated one. In order to provide a powerful eCommerce reporting capability, Adobe had to come up with this complicated variable:

s.products = 
"Category 1;Product 1;Quantity 1;Price 1;events 1;Merch eVars 1
[,Category 2;Product 2;Quantity 2;Price 2;events 2;Merch eVars 2]
[,...]";

Not very pretty. We can start with the same principle, splitting the variable into individual product entries. Multiple product entries are separated by a comma:

SELECT
   product_list,
   split(product_list,',') as splitted_product_list
FROM
   `[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND product_list is not null

As a result, each product ID and the associated events, evars etc. populate different cells within the array result. This gives us a better view of what data each product entry contains:

This is not good enough. We need to dig deeper into the structure. For that reason we will flatten the results using UNNEST. Then we will split again the "splitted_product_list" but this time using the semicolon as a separator. We can convert the result into different columns utilising the offset function which returns a specific entry within an array. Because within the individual product entries the product ID, events and evars are always in the same spot (2nd, 5th and 6th respectively), it is very easy to extract the values:

SELECT
   post_visid_high,
   split(splitted_product_list,';')[offset(1)] as product_id,
   split(splitted_product_list,';')[offset(4)] as events,
   split(splitted_product_list,';')[offset(5)] as evars
FROM
   `[dataset id].[table name]` , 
   unnest(split(product_list,',')) as splitted_product_list
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND product_list is not null

So the less than practical table from before can be transformed into the below:

This however allow us to work only with the product_id column. Events and evars still need further splitting/flattening if we want to do more than a high level analysis. But following the same methodology (split and flatten) you can get to the desired results very easily.

Summary

This is by no means a comprehensive list of queries that you can perform on Adobe Analytics Data Feeds but I hope it can provide you a good starting point!

A few closing notes:

  1. Make sure you understand the differences between post and pre columns. The differences in their values depend on your implementation and report suite configuration.
  2. Make sure to exclude from your calculations hits that are discarded by Adobe. Data feeds contain them as well.
  3. Understand how the visitor identification process works.
  4. SQL used above is configured for convenience and not performance optimisation but if you notice room for improvements, please let me know.

Start playing with Data Feeds! Let me know for any comments!

Adobe Analytics Clickstream & BigQuery introduction

In the previous post tried to elaborate on the main steps involved in loading Adobe Analytics clickstream data from Data Feeds into BigQuery. If you missed it and you are not sure how to start experimenting yourselves, have a look at it!

This time, on this 2-part post will focus on some practical examples on how to approach common cases and create the logic that we usually implement through segments the SQL-way.

Particularly:

  1. Hit-based segments
  2. Visit-based segments
  3. Visitor-based segments
  4. Sequential segments
  5. Success Events
  6. Products and merchandising eVars

Before you actually start experimenting with feeds, the first thought that will come to your mind is “I can do this in 30 seconds within Adobe Workspace, why should I bother with a more complicated method?” The answer is not just about getting numbers out. If you want to succeed long term as an analyst, first you need to be able to work outside of your comfort zone (i.e. Adobe Analytics). There will always be a time where you need to extract data straight from a DBMS. Also, it might be required from your business to combine Web Analytics data with other 1st party data within a Data Warehouse system; the following examples can be applied outside of BigQuery. Additionally, the more you understand how Adobe Analytics works (or any other web analytics solution), the more confidence you will have in your analysis.

Finally, and this is more of a personal thought, I do believe in the future, as web analytics vendors and companies become more mature, web analytics will converge and be absorbed by the overarching Analytics & Data functions. So if you want to step up the ladder, you need to embrace all aspects of the data & analytics tech stack available to you.

Let’s get started!

1. Hit segments

This is the simplest scenario. Since we are already working with raw click-stream hits, all we have to do is filter the rows. For convenience we create a UDF (user-defined function) that applies the segment logic.

CREATE TEMP FUNCTION
  hit_segment(variable STRING,
    match_type STRING,
    value STRING) AS (
    CASE
      WHEN match_type = 'contains' 
           THEN variable LIKE CONCAT('%',value,'%')
      WHEN match_type = 'equals' 
           THEN variable = value END);

The above implements the logic applied in simple hit segments, i.e. X parameter contains or equals Y value. Based on the input parameters (variable name, equal or contains and expected value) it returns True or False which then can be used in the WHERE clause of your query. You can expand it to use more complicated logic. The UDF is invoked as per below:

SELECT
   pagename
FROM
   `[dataset name].[table name]` 
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND hit_segment(page_url,'contains','/product')

Straight-forward! One thing to keep in mind is that UDFs need to be defined in the same query along with your base query. Otherwise the execution will fail.

2. Visitor segments

Have decided to tackle Visitor segments first, introduce the logic and expand it for visit-based segments. This is a two-step process:

  1. Identify the visitors that satisfy the segment criteria and extract their IDs. Adobe recommends to use the post_visid_high and post_visid_low to uniquely identity visitors. We will store the results in an intermediate table (segment table).
  2. Then we need to join the data with your original data set and return only the hits of the particular visitors.

For example, let's extract all the visitors that viewed a product detail page. We are re-using the temporary function "hit_segment" from earlier. Also we need to specify a date range since we are using time-partitioned data. This provides a degree of flexibility because we can select visitors that performed an action in i.e. January but we want to view their activity on a different time period.

SELECT
   post_visid_high,
   post_visid_low
FROM
   `[dataset id].[table name]`  
WHERE
   date_time > '2018-10-01' 
   AND date_time < '2018-10-30' 
   AND hit_segment(page_url,'contains','/product')
GROUP BY
   post_visid_high,
   post_visid_low

The result is a list of values where each row corresponds to individual visitors that viewed at some point in October-2018 a product page.

Saving the result above in a secondary table named "visitor_segment" and then we will join the original data set with the "visitor_segment" table:

SELECT
   data.va_closer_id , 
   count(seg.post_visid_high) as unique_visitors
FROM
   `[dataset id].[table name]` as data
JOIN 
   visitor_segment as seg
ON 
   data.post_visid_high = seg.post_visid_high 
   AND data.post_visid_low = seg.post_visid_low 
   AND data.date_time > '2018-12-01' 
   AND data.date_time < '2019-01-31'

The 2-step process we just followed calculates the unique visitors number per marketing channel over the period Dec-2018 to Jan-2019 that had viewed a product page on Oct-2018.

In general, I find preferable to save segment results in tables rather than use saved queries, temporary tables or nested queries for the following reasons:

  1. Higher reusability/reproducibility of analyses. Once the outcome of a segment is considered valid, then it is easier to share and document the table rather than the underlying query. 
  2. Referencing the segment table in other queries makes the code much cleaner than introducing multiple nested statements.
  3. Temporary tables and views do not get materialised and you are getting billed every time you query them.

For the reasons mentioned above, the same logic (saving segments as tables will be assumed for the rest of post).

3. Visit segments

At the moment there is no single column that calculates a unique visit ID. However we can overcome this by utilising the column "visit_num". It corresponds to the Visit Number dimension within Adobe Reports. It starts at 1 and increments each time a visitor starts a new visit. So for each visitor it can generate a unique visit number in combination with post_visid_high/post_visid_low.

We follow the same approach to create a table named "visit_segment" by introducing the "visit_num" column. It will contain all visits that have a least one hit on a product page during December-2018:

SELECT
   post_visid_high,
   post_visid_low,
   visit_num
FROM
   `[dataset id].[table name]`  
WHERE
   date_time > '2019-12-01' 
   AND date_time < '2018-12-31'  
   AND hit_segment(page_url,'contains','/product')
GROUP BY
   post_visid_high,
   post_visid_low,
   visit_num

Now we can return all the visits that satisfy the segment requirements. Again we take the same query as before but this time we add as an additional join condition the visit_num.

SELECT
   data.va_closer_id as marketing_channel, 
   count(seg.post_visid_high) as unique_visitors
FROM
   `[dataset id].[table name]` as data
JOIN 
   visit_segment as seg
ON 
   data.post_visid_high = seg.post_visid_high 
   AND data.post_visid_low = seg.post_visid_low 
   AND data.visit_num = seg.visit_num
WHERE data.date_time > '2018-12-01' 
   AND data.date_time < '2019-12-31'

Note: the date ranges now need to match between the two queries otherwise you will get no results! This is because we are working on a visit segment (and use the visit_num as the join condition).

At the end you will get a view similar to this one. The marketing channels are represented as numbers and each number depends on each report suite's configuration. Lookup for this ID can be found in the Marketing Channel Manager section.

Summary

Going through the data and getting familiar with it much easier than it actually looks. One of the best exercises I would suggest to do is try to replicate basic reports through BigQuery and match them with the same reports within Adobe Analytics (i.e. daily page views or daily exit links). This way you will also familiarise with other data peculiarities; for example when hits should be excluded from any calculations (hint).

Hope you get an idea of how easy it is to start analysing raw click stream data.

Load Data Feeds to Google BigQuery

Adobe Analytics reporting solutions (i.e. Reports & Analytics, Analysis Workspace, Report Builder or Data Warehouse) offer a wide range of options to view, understand and analyse your data. Admittedly, this covers the majority of most peoples’ data needs. However there comes a time that you need something more. You need the extra granular detail that otherwise is too complicated to get to or you just need a different approach in analyzing your corporate data. There are a number of reasons and those really depend on the particular use case. But even if there is no specific case and it is for exploration’s sake, Adobe Data Feedsis a really useful data extraction solution from Adobe.

However their use can be a little intimidating for first time users. This can be due to a number of reasons:

  1. The format of the data; each row represents individual events taking place on the site instead of the the traditional dimension/metric approach.
  2. Too many technical columns not really relevant or even known to most people.
  3. Availability of columns that contain data before and after additional data processing steps.

Combining the above with potentially huge volume of traffic can easily lead to unmanageable situations making data exploration and analysis almost unfeasible (remember; each page view generates a row in the final extract). Most people work in laptops, does everybody have a laptop that can handle 20-30 GB worth of data in-memory without significantly limiting other functionalities? What about constant queries in a local database? Would you be patient for your queries to take 15 minutes to return some basic results? In those cases a good cloud-based data processing and warehousing solution really helps. It gives the Analytics team the flexibility to perform custom analysis, ETL jobs on demand or scheduled and scale up or down as they see fit within their own “Analytics Lab”; true flexibility that empowers innovation!

This leads to this walkthrough; the key steps required to setup an on-going data pipeline from Adobe Analytics to Google BigQuery.

Let’s get started!

Pre-requisites

The examples shown below will be in Python. Furthermore, you will need at minimum a Google Cloud Platform account with enabled billing and at minimum owner-level access to StorageCompute and BigQuery. For first time users there is a trial version with a few hundred pounds available to spend any way you like. This is ideal for experimentation.

Data Feed configuration & FTP setup

Once you get to the Data Feed configuration, you will notice two sections. General delivery settings and the columns needed for the extract.

General delivery includes the report suite(s), feed interval and date range, delivery option. The simplest option is FTP. If you do not have an FTP server available, Adobe offers this service free of charge. If you use an Adobe-provided FTP server, keep in mind that there is an upper limit of 50 MB. If this is exceeded (depending on the traffic of your report suite and the columns selected), the delivery will fail and you can lose data. In this case you might want to go with hourly feed interval so data is split in smaller parts and avoid the said risk.

In the columns section, you need to specify which fields you want the feed to include. There is a huge list of options ranging from standard technology and geolocation attributes, to all the custom eVars, props and events. The documentation also elaborates on the concept of the “post_” columns versus the regular ones. This will require some additional investment from your end to understand what works best for you. If you are not certain, you can always select both versions of a variable (i.e. both evar1 and post_evar1) and query the appropriate one within BigQuery. One drawback of the data feeds is, if you want to adjust the feed by adding/removing columns and apply this change on historical data, you need to re-configure and re-run the feed.

Note: Adobe natively supports the delivery to Microsoft Azure or Amazon S3. In theory you can send the files to AWS and then pull them into Google Cloud Storage using a transfer job but this implies maintaining active accounts and paying two cloud providers. This is not my personal favorite and the explored approach is restricted to GCP-only resources.

Key settings:

  • Remove Escaped Characters: Do check this! It removes from the feeds escaped characters which cause problems down the line. For example, if in your data collection process includes the tab character “\t”, this is automatically escaped as “\\t”. However as the extracts are tab-delimited files, BigQuery loading process is very easy to break. It is safer to remove them altogether and make your process more reliable.
  • Compression format: Gzip. Data loading into BigQuery is flexible but not with regards to compression. Out of the two options Adobe offers (zip or gzip), BigQuery supports only gzip. So stick with it. This way the .tsv files can be directly loaded without decompressing them while also archiving them in Storage without decompressing them. This can have a big impact in long-term billing.
  • Packaging type: Multiple files. This is a personal preference but I like multiple files. This way the only content that needs decompression are the lookup files. Those contain information like the click stream headers (configured below), look-ups for browser details, operating systems etc. The reason is that the the actual hit-level data include only a mapping reference to those values but not the actual text. If you select Single File, then everything is compressed under a single .zip file.
  • Manifest: Manifest File. It is usually good idea to include the manifest file. This way you can include an automated step to parse the file, extract the data files’ names, validate the MD5 check-sum before pushing the files to the next step of the process. Depending on the Feed Interval (hourly or daily), this can contain multiple entries in each file.

The settings mentioned can be found in the middle section of the New Data Feed page but sometime it is easy to not pay attention to them:

Job scheduling

The extract is scheduled! Now the complicated part. You need to orchestrate the job. With some basic Linux skills, everything can be done within a small Virtual Machine.

Create a small VM; machine type can even be the smallest one; f1-micro (1 vCPU, 0.6 GB memory). The process is not memory or CPU intensive by itself and the VM is for orchestration purposes. The cost is low and if you use the same machine for scheduling additional jobs, the cost per job drops even further.

After the VM is initiated you will need first to make sure Python and pip are installed. Usually the VMs come with Python/pip pre-installed but if for any reason they are not, you can get this easily done following these instructions. Once you are done, you can install all the python requirements through pip (full list):

pip install -r requirements.txt

The actual scheduling operation can be done using cron jobs. In Linux you can execute scripts hourly if you place them under /etc/cron.hourly. The file needs to be executable and not have any file extension (i.e. it needs to be named “execute_pipeline”):

#!/bin/sh
cd /project-path/ &&
python main.py

Data transfer (1/2) – Download through FTP

In order to fetch the data from the FTP, the easiest way is to install pysftp through pip. pysftp provides an easy utility to download entire folders from a remote path:

import pysftp

# Create the connection
ftp_connection = pysftp.Connection(
    host = ftp_host , 
    username = ftp_username,
    password = ftp_password
)
      
# Copy the contents of "feeds_remote_path" into "feeds_local_folder"
ftp_connection.get_d(remotedir = feeds_remote_path, 
localdir = feeds_local_folder)

Data pre-processing (optional)

If you want to be extra cautious about the files you process, this is where the manifest files come into play. They provide a summary of the completed files along with a way to validate them. The most useful info is the name of the file and the MD5 checksum. You can parse the manifest files, extract file names and MD5 values and validate versus the actual files.

Datafeed-Manifest-Version: 1.0
	Lookup-Files: 1
	Data-Files: 1
	Total-Records: 611

	Lookup-File: bugzilla_2012-09-09-lookup_data.tar.gz
	MD5-Digest: af6de42d8b945d4ec1cf28360085308
	File-Size: 63750

	Data-File: 01-bugzilla_2012-09-09.tsv.gz
	MD5-Digest: 9c70bf783cb3d0095a4836904b72c991
	File-Size: 122534
	Record-Count: 611

Only then a file can continue to the next step of the processing. The hashlib package offers a nice way to calculate MD5.

Note: Because we are parsing click stream data, it might be the case the files are in the in the size of 20-30 MB (or even more). You can parse each file in chunks and gradually calculate the overall MD5 check-sum per file to make sure you do not run the Virtual Machine out of memory and crash it (we are using the smallest machine type after all and GCP VMs do not come with Swap space for some reason):

import hashlib

final_hash_md5 = ''

hash_md5 = hashlib.md5()
with open(file_name_with_path, 'rb') as f:
    for chunk in iter(lambda: f.read(4096), b''):
        hash_md5.update(chunk)

final_hash_md5 = hash_md5.hexdigest()

Data transfer (2/2) – Upload to Google Storage

Data files can be easily uploaded with the using the Cloud Storage Python SDK. In particular, you can iterate through all files and upload them one-by-one using the very convenient upload_from_filename method.

from google.cloud import storage
from google.cloud.storage import Blob

# Initiate GCP client and bucket client
client = storage.Client(project='')
bucket = client.get_bucket('')

# Create blob object
blob_name = "path/to/blob/file-1.tsv.gz"
blob = bucket.blob(blob_name)

# Perform upload of data from the local file to the blob object on GCP
blob.upload_from_file(local_file_name)

At random times I have noticed that the uploaded files cannot be properly loaded into BigQuery due to the presence of the null ASCII character (“Bad character (ASCII 0) encountered”). Fortunately by setting the content-type properly resolves this issue.

blob.upload_from_filename(file_name, content_type='application/gzip')

You can modify retrospectively the content type for any given file. The utility tool gsutil that comes with the Google Cloud SDK provides a flexible set of commands. Simply run this in your local command line or through Google Cloud Shell:

gsutil -m setmeta -h "Content-Type: application/gzip" \
    gs://bucket name/path/to/blobs/*.tsv.gz 

Make sure the blob path name contains the file extension as well i.e. “path/to/blob/*.tsv.gz” instead of “path/to/blob/*”. This prevents the parsing of temporary files that might have been created during the data transfer process.

Data Loading into BigQuery

The generic instructions to load a CSV/TSV into a table are easy to follow and this is a great starting point. A few changes are required to make Adobe Data Feeds imports more reliable and less prone to break. The CSV options provide a range of configurations (some of them are not available in the BigQuery console):

from google.cloud import bigquery

job_config = bigquery.LoadJobConfig()
job_config.allow_jagged_rows = True
job_config.ignore_unknown_values = True
job_config.field_delimiter = '\t'
job_config.null_marker = ''
job_config.write_disposition = 'WRITE_APPEND'
job_config.skip_leading_rows = 0
job_config.source_format = bigquery.SourceFormat.CSV

# Configure partitioning
partitioning = bigquery.table.TimePartitioning(field='date_time',
   require_partition_filter = True)

job_config.time_partitioning = partitioning


# The following are not available in the console but are important!
job_config.encoding = 'ISO-8859-1' 
job_config.quote_character = '^'

All options are added to a job configuration object which later on is used during the load operation. The most important are:

Partition: Using time partition to limit your queries (and your costs) is a good long term strategy. Thankfully Adobe provides a field (date_time) that is compatible with the expected format of BigQuery. Make sure you include this when you configure your feed. The attribute “require_partition_filter” makes the usage of the field mandatory during querying time and therefore your users aware of this explicitly.

Encoding: Data Feeds are automatically encoded in ISO-8859-1 and there is no option to change it.

Quote Character: BigQuery by default uses the double quote (“) to understand quoted sections. In the case of Data Feeds however, double-quote does not have any special meaning. Instead we have already removed tabs, new lines, backslashes (see Table Special Characters -1) using the configuration “Remove Escaped Characters” in the feed creation process. Also the caret symbol (^) is being used to escape a set of characters (see Table Special Characters – 2).

So why is this important? If your data collection process contains double-quotes, they will not be escaped and the loading process will fail. My experimentation so far has shown that setting this to the caret symbol works always (even though this might not be the case in all implementations).

Table Special Characters – 1:

Table Special Characters – 2:

Adobe Data Feeds – Special Characters reference

Another aspect of the loading process is the table schema. The quick and easy way is to set all columns to STRING and during querying time cast the types to the most appropriate one (i.e. INT64). The only exception to this is the ‘date_time’ column which is used for partitioning; it needs to be set as ‘TIMESTAMP’. To get the correct column names you have two options:

  1. Manually download the headers from the Data Feeds console. A CSV file containing all headers, each one is a separate line. Then you can store the file locally and reference it every time the loading process is executed.
  2. Extract and parse on every execution the file named “<report suite>_<date>_<hour>-lookup_data.tar.gz. It contains a tarball with all the lookup values and the hit data headers. The tarfile python module helps us extract them in a destination folder.
import tarfile

tar_file = tarfile.open(lookup_data_tar)
tar_file.extractall(path = destination_folder)

No matter which method you follow, you need to make sure the column names follow the requirements set by BigQuery. The following uses regular expression while iterating through all column headers (assuming you have already passed them into a list) and extract only valid characters. Eventually the schema is added to the job configuration. Additionally the type of the field “date_time” is set to TIMESTAMP due to partition purposes.

import re

headers = ['evar1','evar2', ... , 'visid_high','visid_low']

headers_regex_pattern = '[^a-zA-Z0-9_]'
regex = re.compile(headers_regex_pattern)

for header in headers: 
    '''Fields must contain only letters, numbers, and underscores.
    Start with a letter or underscore, 
    and be at most 128 characters'''
    
    column_name = regex.sub('',header)
    column_type = 'STRING'

    # This is required for the partitioning

    if (header == 'date_time'):
        column_type = 'TIMESTAMP'
        
    column = bigquery.SchemaField(column_name, column_type)
    schema_fields.append(column)

job_config.schema = schema_fields

The actual load operation is straight-forward.

from google.cloud import bigquery


uri = 'gs://bucket name/path/to/blobs/*.tsv.gz'
dataset = 'dataset name'
destination_table = 'click_stream'

client = bigquery.Client(project = 'project-id')
dataset_ref = client.dataset(dataset)


load_job = client.load_table_from_uri(uri,\
            dataset_ref.table(destination_table),\
            job_config = job_config)  # job_config is defined earlier

print('Starting job {}'.format(load_job.job_id))
load_job.result()  # Waits for table load to complete.


print('Job finished.')

If for any reason the job fails, the initial error message might not be the most explanatory. In those cases it is always helpful to investigate the load job error stream in Google Cloud Shell:

bq --format=prettyjson show -j job-id

On my last run, it took around 20 minutes to load around 150 GB compressed tsv files. The volumes will vary depending on the traffic of your site, number of columns, information that you collect etc.

Once the table is available, you can see in the Schema tab all the columns created. Personally what fascinates me most if the amount of processing Adobe performs in the background. It is not easy to grasp it until you start exploring the feeds and realize the amount of burden that is hidden from the end-users.

150 GB of compressed tsv files with 500 columns explodes into more than 2 TB and 650 million rows of data. That’s just for a couple of months.

Next time you will ask for the last two years of product views broken down by device, marketing channel, product classifications performed by visitors that have made an order with AOV higher than £20 in previous Black Friday, take a moment to appreciate the work that takes place in the background!

Querying time

Basic queries are easy. However you want to be cost efficient as well. This is where time parting becomes useful (BigQuery Pricing Best Practices). BigQuery charges based on the amount of data you process per query. There is a free tier for the 1st TB processed but that is easy to exceed.

One way to limit costs is to select only relevant columns. However if you have billions of lines, you still going to process a lot of data. For example, if you table contains the last 2 years worth of data but you want to analyse just two months, all the rows will still be processed (and billed) even if you use a WHERE or LIMIT clause. This additional billing is avoided with time partitioning during the table creation step. And by enforcing the time partition parameter to be required, essentially you are telling your users to actively think of the date range and the potential implications.

-- Number of page views by day (excluding discard hits)

SELECT
  EXTRACT(DATE FROM date_time) AS day,
  COUNT(date_time) AS hits_volume
FROM
  `[dataset name].click_stream`
WHERE
  date_time > '2018-08-01'
  AND date_time < '2018-08-10'
  AND page_event = '0'
  AND exclude_hit = '0'
GROUP BY
  day
ORDER BY
  day ASC

This creates a big saving in the long term which probably is hard to realize initially. It comes at the expense of always requiring a WHERE clause for all queries for this table; which is negligible.

What about the products variable? The initial data is not in the most practical format admittedly due to merchandising evars, success event as well as the presence of multiple products in the same hit.

But this is where SQL becomes interesting! The following calculates per day the top-3 most viewed products.

-- Create temporary table with products viewed
WITH product_info AS (
  SELECT
    EXTRACT(DATE FROM date_time) AS day,
    SPLIT(product_list,',') AS products
  FROM
    `[dataset name].click_stream`
  WHERE
    date_time >= '2018-08-02'
    AND date_time < '2018-08-05'
    AND page_event = '0'
    AND exclude_hit = '0'
    AND product_list IS NOT NULL)


SELECT
  day,
  t.product_id,
  t.daily_rank,
  t.instances
FROM (

  -- Exports product id from the products string and calculates the daily rank
  SELECT
    day,
    SPLIT(prd,';')[OFFSET(1)] AS product_id,
    COUNT(prd) AS instances,
    RANK() OVER(PARTITION BY day ORDER BY COUNT(prd) DESC) AS daily_rank
  FROM
    product_info,
    UNNEST(products) AS prd
  WHERE
    prd != ''
  GROUP BY
    day,
    product_id
  ORDER BY
    day ASC,
    instances DESC 
   -- End of nested statement

) as t
WHERE t.daily_rank <= 3

And you get this nice and clean view.

There is a lot above to go through but in order to be able to process and analyse the products variable you need to experiment with arrays in BigQuery, especially the UNNEST statementand its related functionality. In a similar fashion you can extract and format merchandising eVars and metrics. There are powerful functions that will help you greatly in your in-depth analysis!

Additional points

There are a few aspects not really explored in the above walkthrough. Will just summarize:

  • It is assumed that you are authenticated using you personal account. Google Cloud SDKs offer the option to authenticate with a service account as well (highly recommended). Python packages for the different GCP solutions offer an authentication method through a service account.
  • The lookup tables contain information about browser versions, location etc. They act as lookup tables and can be loaded in a similar manner. The difference is that their schema is just two columns with headers “key” and “value” of type string.
  • Instead of a VM in Cloud Compute, a local machine can be used instead. Instead of the cost of the VM, you will have to maintain and operate the machine yourself.
  • Alternative solutions can be used for scheduling like Google Cloud Composer but for the extend and purposes of this walkthrough, a simple VM is more than enough.
  • Alternative approaches can be followed in data processing. Storing compressed data in Storage is much cheaper than decompressed in BigQuery. For this reason, you can always write Dataflow or PySpark jobs to parse your files and load in BigQuery only the necessary columns.

I am offering an out of the box solution that implements the process described. Check https://analyticsmayhem.com/consulting/adobe-analytics-clickstream-bigquery-connector-beta/ for more details!

PS: If you are interested in cross-device reporting, check my previous article.

Cross-Device Identification

Recently I had the pleasure and the opportunity to discuss in the latest MeasureCamp London the approach we followed on implementing Cross Device Identification in Adobe Analytics using Tealium (TiQ & AudienceStream). Some high level outcomes are presented along with ideas for next steps.

Key points:

  1. Why we need cross-device identification and reporting capabilities
  2. Solution design; we worked on top of an already suggested approach and improved it to fit our requirements. Particularly:
    • Marketing channels (including organic and direct traffic) mix to be always properly attributed in all cases and under all circumstances
    • Handle more than one customer login in the same browser
    • Be able to enrich information with offline data
  3. Reporting opportunities
  4. Next steps (ideally!)

Have a look at it!

Please let me know about comments, ideas and potential alternative approaches that you have implemented to solve this problem!

PS: Even though the presentation focuses on Adobe, there was significant amount of work undertaken in Tealium which is not mentioned in the slides.