Adobe Analytics Clickstream & BigQuery introduction

In the previous post tried to elaborate on the main steps involved in loading Adobe Analytics clickstream data from Data Feeds into BigQuery. If you missed it and you are not sure how to start experimenting yourselves, have a look at it!

This time, on this 2-part post will focus on some practical examples on how to approach common cases and create the logic that we usually implement through segments the SQL-way.

Particularly:

  1. Hit-based segments
  2. Visit-based segments
  3. Visitor-based segments
  4. Sequential segments
  5. Success Events
  6. Products and merchandising eVars

Before you actually start experimenting with feeds, the first thought that will come to your mind is “I can do this in 30 seconds within Adobe Workspace, why should I bother with a more complicated method?” The answer is not just about getting numbers out. If you want to succeed long term as an analyst, first you need to be able to work outside of your comfort zone (i.e. Adobe Analytics). There will always be a time where you need to extract data straight from a DBMS. Also, it might be required from your business to combine Web Analytics data with other 1st party data within a Data Warehouse system; the following examples can be applied outside of BigQuery. Additionally, the more you understand how Adobe Analytics works (or any other web analytics solution), the more confidence you will have in your analysis.

Finally, and this is more of a personal thought, I do believe in the future, as web analytics vendors and companies become more mature, web analytics will converge and be absorbed by the overarching Analytics & Data functions. So if you want to step up the ladder, you need to embrace all aspects of the data & analytics tech stack available to you.

Let’s get started!

1. Hit segments

This is the simplest scenario. Since we are already working with raw click-stream hits, all we have to do is filter the rows. For convenience we create a UDF (user-defined function) that applies the segment logic.

CREATE TEMP FUNCTION
  hit_segment(variable STRING,
    match_type STRING,
    value STRING) AS (
    CASE
      WHEN match_type = 'contains' 
           THEN variable LIKE CONCAT('%',value,'%')
      WHEN match_type = 'equals' 
           THEN variable = value END);

The above implements the logic applied in simple hit segments, i.e. X parameter contains or equals Y value. Based on the input parameters (variable name, equal or contains and expected value) it returns True or False which then can be used in the WHERE clause of your query. You can expand it to use more complicated logic. The UDF is invoked as per below:

SELECT
   pagename
FROM
   `[dataset name].[table name]` 
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND hit_segment(page_url,'contains','/product')

Straight-forward! One thing to keep in mind is that UDFs need to be defined in the same query along with your base query. Otherwise the execution will fail.

2. Visitor segments

Have decided to tackle Visitor segments first, introduce the logic and expand it for visit-based segments. This is a two-step process:

  1. Identify the visitors that satisfy the segment criteria and extract their IDs. Adobe recommends to use the post_visid_high and post_visid_low to uniquely identity visitors. We will store the results in an intermediate table (segment table).
  2. Then we need to join the data with your original data set and return only the hits of the particular visitors.

For example, let's extract all the visitors that viewed a product detail page. We are re-using the temporary function "hit_segment" from earlier. Also we need to specify a date range since we are using time-partitioned data. This provides a degree of flexibility because we can select visitors that performed an action in i.e. January but we want to view their activity on a different time period.

SELECT
   post_visid_high,
   post_visid_low
FROM
   `[dataset id].[table name]`  
WHERE
   date_time > '2018-10-01' 
   AND date_time < '2018-10-30' 
   AND hit_segment(page_url,'contains','/product')
GROUP BY
   post_visid_high,
   post_visid_low

The result is a list of values where each row corresponds to individual visitors that viewed at some point in October-2018 a product page.

Saving the result above in a secondary table named "visitor_segment" and then we will join the original data set with the "visitor_segment" table:

SELECT
   data.va_closer_id , 
   count(seg.post_visid_high) as unique_visitors
FROM
   `[dataset id].[table name]` as data
JOIN 
   visitor_segment as seg
ON 
   data.post_visid_high = seg.post_visid_high 
   AND data.post_visid_low = seg.post_visid_low 
   AND data.date_time > '2018-12-01' 
   AND data.date_time < '2019-01-31'

The 2-step process we just followed calculates the unique visitors number per marketing channel over the period Dec-2018 to Jan-2019 that had viewed a product page on Oct-2018.

In general, I find preferable to save segment results in tables rather than use saved queries, temporary tables or nested queries for the following reasons:

  1. Higher reusability/reproducibility of analyses. Once the outcome of a segment is considered valid, then it is easier to share and document the table rather than the underlying query. 
  2. Referencing the segment table in other queries makes the code much cleaner than introducing multiple nested statements.
  3. Temporary tables and views do not get materialised and you are getting billed every time you query them.

For the reasons mentioned above, the same logic (saving segments as tables will be assumed for the rest of post).

3. Visit segments

At the moment there is no single column that calculates a unique visit ID. However we can overcome this by utilising the column "visit_num". It corresponds to the Visit Number dimension within Adobe Reports. It starts at 1 and increments each time a visitor starts a new visit. So for each visitor it can generate a unique visit number in combination with post_visid_high/post_visid_low.

We follow the same approach to create a table named "visit_segment" by introducing the "visit_num" column. It will contain all visits that have a least one hit on a product page during December-2018:

SELECT
   post_visid_high,
   post_visid_low,
   visit_num
FROM
   `[dataset id].[table name]`  
WHERE
   date_time > '2019-12-01' 
   AND date_time < '2018-12-31'  
   AND hit_segment(page_url,'contains','/product')
GROUP BY
   post_visid_high,
   post_visid_low,
   visit_num

Now we can return all the visits that satisfy the segment requirements. Again we take the same query as before but this time we add as an additional join condition the visit_num.

SELECT
   data.va_closer_id as marketing_channel, 
   count(seg.post_visid_high) as unique_visitors
FROM
   `[dataset id].[table name]` as data
JOIN 
   visit_segment as seg
ON 
   data.post_visid_high = seg.post_visid_high 
   AND data.post_visid_low = seg.post_visid_low 
   AND data.visit_num = seg.visit_num
WHERE data.date_time > '2018-12-01' 
   AND data.date_time < '2019-12-31'

Note: the date ranges now need to match between the two queries otherwise you will get no results! This is because we are working on a visit segment (and use the visit_num as the join condition).

At the end you will get a view similar to this one. The marketing channels are represented as numbers and each number depends on each report suite's configuration. Lookup for this ID can be found in the Marketing Channel Manager section.

Summary

Going through the data and getting familiar with it much easier than it actually looks. One of the best exercises I would suggest to do is try to replicate basic reports through BigQuery and match them with the same reports within Adobe Analytics (i.e. daily page views or daily exit links). This way you will also familiarise with other data peculiarities; for example when hits should be excluded from any calculations (hint).

Hope you get an idea of how easy it is to start analysing raw click stream data.

Leave a Reply

Your email address will not be published. Required fields are marked *