Adobe Analytics Clickstream data with dbt & SQL – Guide

Have previously elaborated on both the usefulness of working with raw clickstream data and how powerful I consider dbt (data build tool) in managing SQL models. This time I will take a more practical approach to go through the basic setup to build some initial SQL models in BigQuery based on the Adobe Analytics Clickstream data. The topics to cover are:

  • Loading Adobe Analytics data
  • Setup
  • Compiling SQL files
  • Running dbt
  • Basic Metrics

Loading Adobe Analytics data

Before we get started, you need to have Clickstream data loaded in a BigQuery dataset. If you have not already done so, you can find more information about the topic at Load Data Feeds to Google BigQuery to understand the key steps and configurations of the process.

Furthermore, if you are not Project Admin in the GCP project, make sure you have the following permissions (are needed to execute dbt):

  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User

Setup

First you need to install the library; you have a few options available (https://docs.getdbt.com/docs/installation). For the remainder of this guide, I have utilised pip as an installation method on MacOS. After you have installed it, clone the repository from https://github.com/konosp/adobe-clickstream-dbt. Navigate into adobe-clickstream-dbt. The path models/introduction contains 5 SQL files; those are dynamic files based on Jinja – a templating language.

Let’s examine the first one; clickstream_full.sql. You will notice it looks like a regular SQL file that queries a typical Data feed file from Adobe Analytics. The key point is anything between double curly brackets gets parsed by Jinja. config is capturing model-specific configurations i.e. type of materialisation (view/table/nested). var are dynamic variables (see below) and ref are references to other models. References help dbt assess model hierarchy and appropriate order of execution.

sample model
Sample model

Compiling SQL files

Similar structure exists in the rest of the models. Now execute in terminal dbt compile. If the output is similar to below, the library installation is completed successfully and the models are discoverable.

dbt compile output
dbt compile output

However this does not mean the queries have been executed on BigQuery. dbt has just compiled our parametrised models into executable SQL files. The compiled files are under target/compiled/core/introduction. If you inspect any of the files, you will notice that their syntax is valid SQL.

compiled sql
Compiled SQL

But where are those variables located? The answer is in the dbt_project.yml file. The repository already contains this file. However when you initiate a new project through dbt init (not needed if you clone the repository), this file is created automatically with the basic sample structure. The configuration shown below instructs all the models placed under introduction folder to:

  • Be materialised as tables (unless different config is set within the SQL file).
  • Have available the variables dataset, click_stream_table, start_date and end_date set as the values shown above.
dbt_project.yml
dbt_project.yml

There are multiple ways to organise the settings and a read through the documentation would give you a much better idea of your options.

Before you proceed with executing SQL into BigQuery, you will need to setup your profile.yml file (https://docs.getdbt.com/docs/configure-your-profile#section-getting-started). It handles the connection settings with BigQuery; the repository contains a sample file (profile-sample.yml). You will need to copy the file in the correct path – on MacOS that is ~/.dbt/profiles.yml.

profile-sample.yml
Profile sample

Running dbt

How do you actually run the models? Easy; dbt run! Of course there is a wide list of options to use but for basic demonstration the above will suffice. If everything executes smoothly (authentication, datasets in place, permissions, valid Standard SQL etc.), you should see a similar output.

dbt run output
dbt run output

Note: In order for the execution to work, the profiles.yml is configured using oauth authentication. Make sure you have Google Cloud SDK installed.

All models have executed, the tables are built and you have a nice overview of the results. Note: the dataset configured in dbt_project.yml needs to be present in your GCP project.

bigquery tables
BigQuery Tables

Only 4 tables were created. The reason clickstream_full table is missing is due to the configuration {{ config( materialized=’ephemeral’) }} within clickstream_full.sql file that overwrites any setting indbt_project.yml. This means each model that references clickstream_full will use a temporary table. So no table/view is created for it.

Ephemeral Materialisation
Ephemeral Materialisation

Basic Metrics

Now that everything works, let’s go through the process to calculate some basic metrics. First we need to filter out all hits that should be excluded from reporting (i.e. bots, IP filtering, truncated hits) or are originating from non-web hits (i.e. data sources). To do so, within clickstream_clean.sql we set the following WHERE clauses (documentation):

  • exclude_hit = ‘0’ AND hit_source NOT IN (‘5′,’7′,’8′,’9’)

Next we calculate two different daily metrics, page views and visits (I defined them separately on purpose, explanation below). Each file references clickstream_clean using the ref function.

daily page views model
Daily Page Views model

The repository contains the exact files for both metrics. However remember that their date range depends on the values (start_date & end_date) configured in dbt_project.yml.

The best way to understand different dependencies is through the documentation generated by dbt. Now it is obvious that clickstream_clean depends on the root model clickstream_full (ephemeral model). Also its children are daily_page_views and daily_visits. If the clean dataset definition is modified (i.e. the SQL is updated), all its children will have to be updated (hint: this is managed magnificently by dbt!).

model graph
Model Graph – it derives for the usage of the ref() function.

This is just an idea of how you can structure your data workspace and start working with clickstream in a much better fashion.

Epilogue

Utilising SQL models using dbt makes the life of an analyst working through Adobe Analytics Clickstream data much easier and organised. There is huge potential and room for improvement in our daily analysis by taking advantage of such tools!

In the future I will be enhancing the models with more complicated and advanced cases! Let me know for any comments or thoughts!

PS: All code samples are at https://github.com/konosp/adobe-clickstream-dbt

Adobe Analytics and time between funnel steps

Recently I had to deal with the question “How much time it takes for a customer to go through funnel steps?” Simple question on the face of it. But the more I was thinking about it, the more complicated it was becoming. So what is the best way to measure and report using Adobe Analytics the time required for your visitors to go through different funnel steps?

The main requirement that I had to meet was something among the lines below:

Requirements
Requirements

I am going to demonstrate a few alternatives approaches to answering this question. Depending on your resources, setup and requirements you have a few options to consider:

  1. Adobe Analytics out-of-the-box report
  2. Adobe Analytics plugin
  3. Data collection customised solution
  4. Adobe Data Feeds and SQL

1) Adobe Analytics Out of the box report

The easiest way to answer such a question would be to use the Adobe Analytics default report “Time prior to event”. It calculates the amount of time required for a success event to occur within a visit. However the report is not really practical because the timing values are returned as dimension values (and grouped in buckets). This is not really useful if you want to have a trended analysis i.e. monitor the over-time performance of customers performing a certain action. If this was possible, this would be the best combo of ease-of-implementation versus utility solution.

Time Prior to Event report
Time Prior to Event sample

2) Adobe Analytics plugin

There is a plugin to evaluate the differenc between two success events; “getTimeToComplete“. On the surface it looks nice solution. However it is hard to ignore some key limitations:

  1. It calculates the difference between two events only. You have to code different logic for different time difference events. I.e. one block for “Success Event 12” – “Success Event 10”, different for “Success Event 40” – “Success Event 31”.
  2. Suggests to store the value in a dimension, good for classification but not so good for trended analysis.

There is always the option to modify the plugin and customise it but if you are willing to do down that path, please read below!

3) Data collection custom solution

This is where things can get complicated! The more I was thinking the process, the more I was coming up with caveats that needed to be addressed. The initial success events and prop configured would be:

  • 1 success event will capture the time difference in seconds from the “Start” of the journey; numeric event
  • 1 success event will capture the the total number of instances a measurement was recorded
  • 1 prop will capture the current step i.e. “step 1”, “step 2”, “end”. This will be utilised later on for hit-based segmentation to isolate the measurements for each individual step

Now let’s discuss the considerations:

  • Each measurement would need to be captured once per a logical time frame (or session). This is required to avoid counting multiple times cases where users go through certain steps of the process multiple times.
  • If the user comes back to the site after that logical time frame, we would like to capture the measurements.
  • Need to define explicitly all the steps of the process in order to adjust the data collection process. That has the inherent limitation of no historical data and low flexibility.

In order to deal with that, I had to use a concept called journey. The journey has an explicit starting point and fixed steps. We need to know those in order to calculate the various time differences. A journey:

  1. Will start once and has a certain duration.
  2. During that time period, it cannot restart
  3. During the same time period, each measurement will be captured only once.
  4. Every time the user goes through the steps, the journey is being checked if it has expired. If yes, then it is eligible to be restarted and all subsequent steps to be measured.

As you can see, there are already a lot of constraints and limitations. But let’s see how this looks visually. The below highlights the basic steps that a user would have to go through. Once the “Registration Start” step is triggered, the journey is being initiated and the expiration is set 30 minutes (or any time frame that works for you) in the future. Then each step is recorded only once during that journey.

Journey
Journey

In order to achieve that, the below code was required. To be honest, the measurements were surprisingly good but as you can see from the code the implementation is far from straight-forward. All the logic, considerations and constraints had to be implemented. Have a look:

But how the results looked like? Is this something that can be used? Actually it can; with some limitations! Since all the measurements are captured using 2 success events and a prop, we need some segmentation to get the numbers. Particularly, to calculate the time required to get from the “start” of the journey to each step we need do as follows. First we calculate the average time since the start of registration.

Calculated metric
Calculating the average time since the journey start

After that, we segment on hit-level using the prop previously mentioned. This will isolate the measurements for the time required to reach each step.

Reporting configuration of metrics
Reporting configuration

The above reporting configuration can lead to a visualisation like below.

Trended reporting of time measurements
Trended view of “Time Since Start” for different steps.

Unfortunately the above calculations are based on the mean value. That means that they are very sensitive to extremely high measurements, and there are a lot. In summary:

  1. No historical data
  2. Measurements heavily depend on data collection
  3. No flexibility for custom funnels
  4. High sensitivity to high values

The initial approach could provide some good basis for monitoring the time differences between certain actions of interest. However all the work above seems like a bit too much. Let’s explore an alternative!

4) Adobe Analytics Data Feeds and SQL

Having mentioned all the methods above, they feel limiting versus my preferred option. Working with Adobe Data Feeds using SQL! I have been arguing in favour of working with Data Feeds for some time now. The are very flexible and will open to you a whole new world to explore. In order to find out more details about loading data and working in BigQuery, please visit my previous articles on it:

Let’s start with the ideal result first, what we really need! The table below provides indicative measurements where we compare the Mean and Median time required to reach each registration step from the first step. Additionally we can see the median values are much lower highlight a previously mentioned limitation; average values are highly sensitive to extreme measurements. PS: Numbers are not real, I fiddled with them. nevertheless the principle still applies.

Funnel StepMeanMedian
registration: step 2852494
registration: step 31346935
registration: step 415491080
end21011380

But how do we get there? We will break down some key steps. The GitHub Gist below is organised in 4 separate SQL queries, each one meant to be executed on the back of the results of the previous one.

Let’s analyse them one by one.

  • 1_filterRows.sql: This is based on your implementation by the key idea is we want to isolate the rows that each individual step is being recorded. You might need to filter based on a different prop, eVar or event. However you always need the visitor ID, visit number, step name and timestamp
  • 2_calculateStepStart.sql: Grouping by visitor ID and visit number in order to obtain the earliest timing that each step of the journey took place
  • 3_calculateTimeDifferences.sql: Calculate the time difference for each individual step since the earliest measurement per visit & visitor ID
  • 4_calculateMeasurements.sql: Calculate per funnel step the median and mean values

Hope you have found the overview helpful! Given you have the process in place to load Adobe Analytics Data Feeds into a database, you are able to generate the same results much easier.

As I have many times re-iterated, working with the raw clickstream data is the way forward for the analysts! Have fun!

Advanced cases of Adobe Clickstream with BigQuery

In the previous post provided a few basic examples on how to start your journey analysing Adobe Analytics Clickstream Data Feeds in BigQuery. This time going to complete the remaining points of the agenda. Overall we have:

  1. Hit-based segments (part 1)
  2. Visit-based segments (part 1)
  3. Visitor-based segments (part 1)
  4. Sequential segments (part 2)
  5. Success Events (part 2)
  6. Products and merchandising eVars (part 2)

You can review points 1-3 in the previous part in case you missed it. Let’s continue with the rest of it!

4. Sequential segments

In order to implement a sequential segment, we need take the raw data and self-join the table on the commonly used IDs mentioned before (i.e. post_visid_high, post_visid_low, visit_num). However we need to apply the time-based constraints i.e. on step 1 the customer visited a product URL and on step 2 a checkout URL. This can be easily done using a time comparison on the date_time column. Eventually we can extract in a separate table post_visid_high, post_visid_low and visit_num values that satisfy the constrains.

SELECT
   step1.post_visid_high,
   step1.post_visid_low,
   step1.visit_num -- Visit container
FROM
   `[dataset id].[table name]` AS step1, -- self-join
   `[dataset id].[table name]` AS step2
WHERE
   step1.date_time > '2018-10-01'
   AND step1.date_time < '2018-10-05'
   AND step2.date_time > '2018-10-01'
   AND step2.date_time < '2018-10-05'
   AND step1.date_time < step2.date_time -- Time-based Comparison
   AND hit_segment(step1.page_url,'contains','/product/')
   AND hit_segment(step2.page_url,'contains','/checkout/')
   AND step1.post_visid_high = step2.post_visid_high
   AND step1.post_visid_low = step2.post_visid_low
   AND step1.visit_num = step2.visit_num -- Visit container
GROUP BY
   step1.post_visid_high,
   step1.post_visid_low,
   step1.visit_num -- Visit container

A couple of notes:

  1. The date time conditions are needed because we are querying a time-partitioned table. If you opt-out of such configuration (not recommended), then you can ignore them.
  2. A key comparison is "step1.date_time < step2.date_time". This enforces the logic that the customer first took action A and then action B.

The results are a set of visitor IDs/visit numbers which meeting the segment criteria. We can save it an intermediate table and then join with the row data on all columns. If we remove the column "visit_num", the results will become a sequential visitor segment.

5. Success Events

Manipulating success events requires a bit more work. First you need to re-format the data and then know what each number represents as well. Let's have a first look at success events:

SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list
FROM
   `[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

The columns post_visid_high/low, visit_num are familiar but the event_list is a comma-delimited list of numbers:

This is not practical for two reasons; you cannot really count the number of times each event occurs or perform any analytical operation and you actually cannot tell what the above numbers mean (185 anyone?) unless you know them by heart.

Extracts from Adobe Data Feeds come with a convenient mapping file (event_list.tsv) that maps the numbers to something more meaningful, especially for out-of-the box events. The neat thing is that the events_list column contains instances of eVarXX. This is handy if you want to analyse hits where an instance of eVar is set or just count/compare instances (for this you need to use the post version of the events_list).

Having said the above, two are the key BigQuery features needed to properly process the events:

  • SPLIT function: Since the event_list is a comma-delimited string value, we can split the values as a first step so we can have each individual event in a separate cell. The syntax of the function is straight-forward:
SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list,
   split(event_list,',') as splitted_events_list
FROM`[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

The outcome is that the initial row is converted into an array:

However this is still not enough. The "splitted_events_list" column above technically contains in each cell an array. So in rows 1 & 2 event 188 is in cell 2 while in rows 3 & 4 it is in cell 3. The more complicated the data gets, the harder it is to analyse it. That leads us to the next point!

  • UNNEST function: After we have generated the array, now we need to flatten it. At this stage BigQuery cross-joins each row with each element of the array in splitted_events_list column that corresponds to the that row:
SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   event_list,
   events_formated
FROM
   `[dataset id].[table name]`, 
   unnest(split(event_list,',')) as events_formated
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

So from the previous table, we end up having one entry per post_visid_high/low and value in the event_list (i.e. now we have a full row for event 188 for all cases that the event appears):

To make things easier we can join with the "event" file (the key and column values were used as column headers during import):

SELECT
   post_visid_high,
   post_visid_low,
   visit_num,
   events_formated,
   mapping.value 
FROM
   `[dataset id].[table name]` as hits, 
   unnest(split(event_list,',')) as events_formated
JOIN 
   `[dataset id].event` as mapping
ON 
   events_formated = mapping.key 
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND event_list is not null

Now it is much easier to filter and work on individual events. Following the same logic we used in segments, you can generate visitor or visit IDs that have a specific event or volume of events:

Bonus point: As you observe above, not everything is as clean as it should be. Who knows what does Custom Event 7 mean?! Apart from the Analytics team, nobody! You can create an external table maintained in Google Drive that holds descriptive values and join them thus providing a more user-friendly view.

6. Products and merchandising eVars

Admittedly, the products variable is the most complicated one. In order to provide a powerful eCommerce reporting capability, Adobe had to come up with this complicated variable:

s.products = 
"Category 1;Product 1;Quantity 1;Price 1;events 1;Merch eVars 1
[,Category 2;Product 2;Quantity 2;Price 2;events 2;Merch eVars 2]
[,...]";

Not very pretty. We can start with the same principle, splitting the variable into individual product entries. Multiple product entries are separated by a comma:

SELECT
   product_list,
   split(product_list,',') as splitted_product_list
FROM
   `[dataset id].[table name]`
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND product_list is not null

As a result, each product ID and the associated events, evars etc. populate different cells within the array result. This gives us a better view of what data each product entry contains:

This is not good enough. We need to dig deeper into the structure. For that reason we will flatten the results using UNNEST. Then we will split again the "splitted_product_list" but this time using the semicolon as a separator. We can convert the result into different columns utilising the offset function which returns a specific entry within an array. Because within the individual product entries the product ID, events and evars are always in the same spot (2nd, 5th and 6th respectively), it is very easy to extract the values:

SELECT
   post_visid_high,
   split(splitted_product_list,';')[offset(1)] as product_id,
   split(splitted_product_list,';')[offset(4)] as events,
   split(splitted_product_list,';')[offset(5)] as evars
FROM
   `[dataset id].[table name]` , 
   unnest(split(product_list,',')) as splitted_product_list
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND product_list is not null

So the less than practical table from before can be transformed into the below:

This however allow us to work only with the product_id column. Events and evars still need further splitting/flattening if we want to do more than a high level analysis. But following the same methodology (split and flatten) you can get to the desired results very easily.

Summary

This is by no means a comprehensive list of queries that you can perform on Adobe Analytics Data Feeds but I hope it can provide you a good starting point!

A few closing notes:

  1. Make sure you understand the differences between post and pre columns. The differences in their values depend on your implementation and report suite configuration.
  2. Make sure to exclude from your calculations hits that are discarded by Adobe. Data feeds contain them as well.
  3. Understand how the visitor identification process works.
  4. SQL used above is configured for convenience and not performance optimisation but if you notice room for improvements, please let me know.

Start playing with Data Feeds! Let me know for any comments!

Adobe Analytics Clickstream & BigQuery introduction

In the previous post tried to elaborate on the main steps involved in loading Adobe Analytics clickstream data from Data Feeds into BigQuery. If you missed it and you are not sure how to start experimenting yourselves, have a look at it!

This time, on this 2-part post will focus on some practical examples on how to approach common cases and create the logic that we usually implement through segments the SQL-way.

Particularly:

  1. Hit-based segments
  2. Visit-based segments
  3. Visitor-based segments
  4. Sequential segments
  5. Success Events
  6. Products and merchandising eVars

Before you actually start experimenting with feeds, the first thought that will come to your mind is “I can do this in 30 seconds within Adobe Workspace, why should I bother with a more complicated method?” The answer is not just about getting numbers out. If you want to succeed long term as an analyst, first you need to be able to work outside of your comfort zone (i.e. Adobe Analytics). There will always be a time where you need to extract data straight from a DBMS. Also, it might be required from your business to combine Web Analytics data with other 1st party data within a Data Warehouse system; the following examples can be applied outside of BigQuery. Additionally, the more you understand how Adobe Analytics works (or any other web analytics solution), the more confidence you will have in your analysis.

Finally, and this is more of a personal thought, I do believe in the future, as web analytics vendors and companies become more mature, web analytics will converge and be absorbed by the overarching Analytics & Data functions. So if you want to step up the ladder, you need to embrace all aspects of the data & analytics tech stack available to you.

Let’s get started!

1. Hit segments

This is the simplest scenario. Since we are already working with raw click-stream hits, all we have to do is filter the rows. For convenience we create a UDF (user-defined function) that applies the segment logic.

CREATE TEMP FUNCTION
  hit_segment(variable STRING,
    match_type STRING,
    value STRING) AS (
    CASE
      WHEN match_type = 'contains' 
           THEN variable LIKE CONCAT('%',value,'%')
      WHEN match_type = 'equals' 
           THEN variable = value END);

The above implements the logic applied in simple hit segments, i.e. X parameter contains or equals Y value. Based on the input parameters (variable name, equal or contains and expected value) it returns True or False which then can be used in the WHERE clause of your query. You can expand it to use more complicated logic. The UDF is invoked as per below:

SELECT
   pagename
FROM
   `[dataset name].[table name]` 
WHERE
   date_time > '2018-10-03'
   AND date_time < '2018-10-04'
   AND hit_segment(page_url,'contains','/product')

Straight-forward! One thing to keep in mind is that UDFs need to be defined in the same query along with your base query. Otherwise the execution will fail.

2. Visitor segments

Have decided to tackle Visitor segments first, introduce the logic and expand it for visit-based segments. This is a two-step process:

  1. Identify the visitors that satisfy the segment criteria and extract their IDs. Adobe recommends to use the post_visid_high and post_visid_low to uniquely identity visitors. We will store the results in an intermediate table (segment table).
  2. Then we need to join the data with your original data set and return only the hits of the particular visitors.

For example, let's extract all the visitors that viewed a product detail page. We are re-using the temporary function "hit_segment" from earlier. Also we need to specify a date range since we are using time-partitioned data. This provides a degree of flexibility because we can select visitors that performed an action in i.e. January but we want to view their activity on a different time period.

SELECT
   post_visid_high,
   post_visid_low
FROM
   `[dataset id].[table name]`  
WHERE
   date_time > '2018-10-01' 
   AND date_time < '2018-10-30' 
   AND hit_segment(page_url,'contains','/product')
GROUP BY
   post_visid_high,
   post_visid_low

The result is a list of values where each row corresponds to individual visitors that viewed at some point in October-2018 a product page.

Saving the result above in a secondary table named "visitor_segment" and then we will join the original data set with the "visitor_segment" table:

SELECT
   data.va_closer_id , 
   count(seg.post_visid_high) as unique_visitors
FROM
   `[dataset id].[table name]` as data
JOIN 
   visitor_segment as seg
ON 
   data.post_visid_high = seg.post_visid_high 
   AND data.post_visid_low = seg.post_visid_low 
   AND data.date_time > '2018-12-01' 
   AND data.date_time < '2019-01-31'

The 2-step process we just followed calculates the unique visitors number per marketing channel over the period Dec-2018 to Jan-2019 that had viewed a product page on Oct-2018.

In general, I find preferable to save segment results in tables rather than use saved queries, temporary tables or nested queries for the following reasons:

  1. Higher reusability/reproducibility of analyses. Once the outcome of a segment is considered valid, then it is easier to share and document the table rather than the underlying query. 
  2. Referencing the segment table in other queries makes the code much cleaner than introducing multiple nested statements.
  3. Temporary tables and views do not get materialised and you are getting billed every time you query them.

For the reasons mentioned above, the same logic (saving segments as tables will be assumed for the rest of post).

3. Visit segments

At the moment there is no single column that calculates a unique visit ID. However we can overcome this by utilising the column "visit_num". It corresponds to the Visit Number dimension within Adobe Reports. It starts at 1 and increments each time a visitor starts a new visit. So for each visitor it can generate a unique visit number in combination with post_visid_high/post_visid_low.

We follow the same approach to create a table named "visit_segment" by introducing the "visit_num" column. It will contain all visits that have a least one hit on a product page during December-2018:

SELECT
   post_visid_high,
   post_visid_low,
   visit_num
FROM
   `[dataset id].[table name]`  
WHERE
   date_time > '2019-12-01' 
   AND date_time < '2018-12-31'  
   AND hit_segment(page_url,'contains','/product')
GROUP BY
   post_visid_high,
   post_visid_low,
   visit_num

Now we can return all the visits that satisfy the segment requirements. Again we take the same query as before but this time we add as an additional join condition the visit_num.

SELECT
   data.va_closer_id as marketing_channel, 
   count(seg.post_visid_high) as unique_visitors
FROM
   `[dataset id].[table name]` as data
JOIN 
   visit_segment as seg
ON 
   data.post_visid_high = seg.post_visid_high 
   AND data.post_visid_low = seg.post_visid_low 
   AND data.visit_num = seg.visit_num
WHERE data.date_time > '2018-12-01' 
   AND data.date_time < '2019-12-31'

Note: the date ranges now need to match between the two queries otherwise you will get no results! This is because we are working on a visit segment (and use the visit_num as the join condition).

At the end you will get a view similar to this one. The marketing channels are represented as numbers and each number depends on each report suite's configuration. Lookup for this ID can be found in the Marketing Channel Manager section.

Summary

Going through the data and getting familiar with it much easier than it actually looks. One of the best exercises I would suggest to do is try to replicate basic reports through BigQuery and match them with the same reports within Adobe Analytics (i.e. daily page views or daily exit links). This way you will also familiarise with other data peculiarities; for example when hits should be excluded from any calculations (hint).

Hope you get an idea of how easy it is to start analysing raw click stream data.