Clustering in BigQuery
BigQuery is well-known for being lighting fast in terms of processing enormous volumes of data. There is no doubt about this. We, as product users, have a few options in our disposal to achieve so. First and foremost is partitioning tables by date (Working With Partitioned Tables). The concept of a partition tabled is pretty much straight forward; allows to split the data based on a timestamp or a date column.
While I was researching on YouTube, discovered another related and useful feature: Clustered Tables (no idea how I missed that!). The main idea is that a column (i.e. a categorical column that notionally groups your data) can be used to logically organise and co-locate the data. When this column is used in a WHERE statement, it makes the performance of the query much better.
In short, given an imaginary table below, the left table is unorganised (not partitioned). By applying partitioning (Date) and clustering (Tags) the data is grouped logically. This in turn will make your queries much faster when you introduce those columns in the WHERE statement.
This video (even though a bit annoying) is informative about the purpose of the feature.
Clustering and Adobe Analytics
All this sounds nice but this is not a new feature in BigQuery. However it makes it very useful when you try to work with clickstream data from Adobe Data Feeds (or any other raw behavioural data).
How many times did you have to segment on one of the following criteria:
- Domain name
- Page type and/or page category
- User login status
- Language and/or country
- Marketing channel
The list goes on. Most of the time, information like above ends up in a dedicated variable (either traffic or conversion). Those cases are the perfect candidate for a dimension to apply clustering on. With the amount of data generated from a high volume site, optimising your queries is only logical. It will not only save you time but also money as well since clustering limits the amount of data processed.
In other cases, we are used to grouping data in a single variable while using processing rules to classify the data. Most typical example is the page name; having a variable format that concatenate multiple values. While this approach allows to limit the usage of multiple variables does not allow to apply clustering for a portion of the value (i.e. WHERE page_type = ‘home page’). Since the classifications are not available in Data Feeds, some extra effort is required to split the values in multiple variables. Instead of:
- prop1 = [site name]:[page type]:[page category]
it is preferable to split the values in multiple variables:
- prop1 = [site name]
- prop2 = [page type]
- prop3 = [page category]
Using this simpler structure (which initially looks like you unnecessary use a lot of variables), will allow you to take advantage of clustering. A few caveats to be aware though (link):
- Clustering can be applied only during the creation of the table. So if you have a huge table you would need to re-create the table and reload all data. Also we cannot modify the clustering columns after the creation of the table.
- Clustering columns must be top-level, non-repeated columns of one of the following types:
GEOGRAPHY. This limits the options but only a little bit. With the appropriate planning on data collection,
STRINGtype would be more than enough.
- Working with the products or events variable becomes tricky in their raw format and I am not going to discuss it now!
Overall, it is up to you what dimensions you will use to organise your data. The key point is that by properly thinking through your dimensions’ data, you can achieve efficiencies both in terms of cost and processing time. So there is no excuse for not doing so!
Best practises on querying clustered tables https://cloud.google.com/bigquery/docs/querying-clustered-tables
For other posts related to Adobe Analytics Data Feeds, check https://analyticsmayhem.com/tag/datafeeds/