Advanced cases of Adobe Clickstream with BigQuery

In the previous post provided a few basic examples on how to start your journey analysing Adobe Analytics Clickstream Data Feeds in BigQuery. This time going to complete the remaining points of the agenda. Overall we have:

  1. Hit-based segments (part 1)
  2. Visit-based segments (part 1)
  3. Visitor-based segments (part 1)
  4. Sequential segments (part 2)
  5. Success Events (part 2)
  6. Products and merchandising eVars (part 2)

You can review points 1-3 in the previous part in case you missed it. Let’s continue with the rest of it!

4. Sequential segments

In order to implement a sequential segment, we need take the raw data and self-join the table on the commonly used IDs mentioned before (i.e. post_visid_high, post_visid_low, visit_num). However we need to apply the time-based constraints i.e. on step 1 the customer visited a product URL and on step 2 a checkout URL. This can be easily done using a time comparison on the date_time column. Eventually we can extract in a separate table post_visid_high, post_visid_low and visit_num values that satisfy the constrains.

SELECT
   step1.post_visid_high,
   step1.post_visid_low,
   step1.visit_num -- Visit container
FROM
   `[dataset id].[table name]` AS step1, -- self-join
   `[dataset id].[table name]` AS step2
WHERE
   step1.date_time > '2018-10-01'
   AND step1.date_time < '2018-10-05'
   AND step2.date_time > '2018-10-01'
   AND step2.date_time < '2018-10-05'
   AND step1.date_time < step2.date_time -- Time-based Comparison
   AND hit_segment(step1.page_url,'contains','/product/')
   AND hit_segment(step2.page_url,'contains','/checkout/')
   AND step1.post_visid_high = step2.post_visid_high
   AND step1.post_visid_low = step2.post_visid_low
   AND step1.visit_num = step2.visit_num -- Visit container
GROUP BY
   step1.post_visid_high,
   step1.post_visid_low,
   step1.visit_num -- Visit container

A couple of notes:

  1. The date time conditions are needed because we are querying a time-partitioned table. If you opt-out of such configuration (not recommended), then you can ignore them.
  2. A key comparison is "step1.date_time < step2.date_time". This enforces the logic that the customer first took action A and then action B.

The results are a set of visitor IDs/visit numbers which meeting the segment criteria. We can save it an intermediate table and then join with the row data on all columns. If we remove the column "visit_num", the results will become a sequential visitor segment.

5. Success Events

Manipulating success events requires a bit more work. First you need to re-format the data and then know what each number represents as well. Let's have a first look at success events:

SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list
FROM
   `[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

The columns post_visid_high/low, visit_num are familiar but the event_list is a comma-delimited list of numbers:

This is not practical for two reasons; you cannot really count the number of times each event occurs or perform any analytical operation and you actually cannot tell what the above numbers mean (185 anyone?) unless you know them by heart.

Extracts from Adobe Data Feeds come with a convenient mapping file (event_list.tsv) that maps the numbers to something more meaningful, especially for out-of-the box events. The neat thing is that the events_list column contains instances of eVarXX. This is handy if you want to analyse hits where an instance of eVar is set or just count/compare instances (for this you need to use the post version of the events_list).

Having said the above, two are the key BigQuery features needed to properly process the events:

  • SPLIT function: Since the event_list is a comma-delimited string value, we can split the values as a first step so we can have each individual event in a separate cell. The syntax of the function is straight-forward:
SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list,
   split(event_list,',') as splitted_events_list
FROM`[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

The outcome is that the initial row is converted into an array:

However this is still not enough. The "splitted_events_list" column above technically contains in each cell an array. So in rows 1 & 2 event 188 is in cell 2 while in rows 3 & 4 it is in cell 3. The more complicated the data gets, the harder it is to analyse it. That leads us to the next point!

  • UNNEST function: After we have generated the array, now we need to flatten it. At this stage BigQuery cross-joins each row with each element of the array in splitted_events_list column that corresponds to the that row:
SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list,
   events_formated
FROM
   `[dataset id].[table name]`, 
   unnest(split(event_list,',')) as events_formated
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

So from the previous table, we end up having one entry per post_visid_high/low and value in the event_list (i.e. now we have a full row for event 188 for all cases that the event appears):

To make things easier we can join with the "event" file (the key and column values were used as column headers during import):

SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   events_formated,
   mapping.value 
FROM
   `[dataset id].[table name]` as hits, 
   unnest(split(event_list,',')) as events_formated
JOIN 
   `[dataset id].event` as mapping
ON 
   events_formated = mapping.key 
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

Now it is much easier to filter and work on individual events. Following the same logic we used in segments, you can generate visitor or visit IDs that have a specific event or volume of events:

Bonus point: As you observe above, not everything is as clean as it should be. Who knows what does Custom Event 7 mean?! Apart from the Analytics team, nobody! You can create an external table maintained in Google Drive that holds descriptive values and join them thus providing a more user-friendly view.

6. Products and merchandising eVars

Admittedly, the products variable is the most complicated one. In order to provide a powerful eCommerce reporting capability, Adobe had to come up with this complicated variable:

s.products = 
"Category 1;Product 1;Quantity 1;Price 1;events 1;Merch eVars 1
[,Category 2;Product 2;Quantity 2;Price 2;events 2;Merch eVars 2]
[,...]";

Not very pretty. We can start with the same principle, splitting the variable into individual product entries. Multiple product entries are separated by a comma:

SELECT
   product_list,
   split(product_list,',') as splitted_product_list
FROM
   `[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND product_list is not null

As a result, each product ID and the associated events, evars etc. populate different cells within the array result. This gives us a better view of what data each product entry contains:

This is not good enough. We need to dig deeper into the structure. For that reason we will flatten the results using UNNEST. Then we will split again the "splitted_product_list" but this time using the semicolon as a separator. We can convert the result into different columns utilising the offset function which returns a specific entry within an array. Because within the individual product entries the product ID, events and evars are always in the same spot (2nd, 5th and 6th respectively), it is very easy to extract the values:

SELECT
   post_visid_high,
   split(splitted_product_list,';')[offset(1)] as product_id,
   split(splitted_product_list,';')[offset(4)] as events,
   split(splitted_product_list,';')[offset(5)] as evars
FROM
   `[dataset id].[table name]` , 
   unnest(split(product_list,',')) as splitted_product_list
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND product_list is not null

So the less than practical table from before can be transformed into the below:

This however allow us to work only with the product_id column. Events and evars still need further splitting/flattening if we want to do more than a high level analysis. But following the same methodology (split and flatten) you can get to the desired results very easily.

Summary

This is by no means a comprehensive list of queries that you can perform on Adobe Analytics Data Feeds but I hope it can provide you a good starting point!

A few closing notes:

  1. Make sure you understand the differences between post and pre columns. The differences in their values depend on your implementation and report suite configuration.
  2. Make sure to exclude from your calculations hits that are discarded by Adobe. Data feeds contain them as well.
  3. Understand how the visitor identification process works.
  4. SQL used above is configured for convenience and not performance optimisation but if you notice room for improvements, please let me know.

Start playing with Data Feeds! Let me know for any comments!

Leave a Reply

Your email address will not be published. Required fields are marked *