The purpose of AB Testing in the digital world is to perform a controlled trial of a hypothesis and make the most informed decision. To assess the AB testing results we rely on calculating their statistical significance through the p-value. This article will discuss the process of calculating those results and better understand the underlying process.
Today a lot of statistical calculators exist online. A few common examples are:
They provide an easy-to-use way to quickly understand key information like the statistical significance of your tests, volume of traffic required and volume of conversions.
Regardless of how easy it is to use them, a savvy analyst should understand and appreciate the underlying methodology used to generate the derived results. Τhis post focuses on how from the original data (i.e. conversions) we estimate the statistical significance of our tests.
Z-Score & Z-test
Normal distribution in probability theory is the distribution of a continuous random variable that follows the – informally known – bell curve. The mean (μ) and variance (σ2) of the population are known. Depending one their values the centre and spread of the bell change.
To understand if the mean of a sample is significantly different from the population mean (μ), we need to perform a Z-test. At the moment we are interested in a two-tails test that is formulated as:
- H0: m = m0 – Null hypothesis – The mean of our sample (m or X-bar) is not different to the value m0.
- Ha: m ≠ m0 – Alternative hypothesis – The mean of our sample (m) is different to the value m0.
The Z-score is calculated based on the formula below:
- X-bar: sample mean
- μ: population mean
- σ: population standard deviation
The Z-score helps us standardise our variable and perform the comparison on the same scale. We then calculate the probability a value has to occur. A sample has 95% probability to be within almost 2 (1.96 to be precise) standard deviations from the mean.
Note: In general if we did not know the population mean and standard deviation then we would need to perform a t-test.
Understanding our AB test data
The next step is to gain a better insight on our data. Our data in it’s most granular view contains individual entries for each user (or session) along with information about their conversion outcome and to which group they belong to. In this example, Outcome is the variable of interest and Group helps us split the population between control and variant.
Usually in a large scale test we will have 10’s or 100’s of thousands of users. This will work in our favour later as we will need large volumes of data to satisfy some key assumptions.
If you try to plot this data you will notice that it looks far from normally spread which is a requirement for the z-test. In fact our “conversion” variable follows the Hypergeometric distribution. Because our sample size is big and we perform only one (n = 1) attempt per sample (i.e. each customer gets a single “attempt” to convert), a binomial approximation is appropriate.
For the same reason (n=1) our data falls under a special case of binomial called Bernoulli distribution.
Having said the above, our random variable has variance σ2 equal to p * (1-p) where p is the probability of success of each sample; our conversion rate.
But how does this helps us calculate the statistical significance in AB Testing? Is 1% difference in conversion rate significant?
Central Limit Theorem (CLT)
To proceed, we need first to understand the central limit theorem and how it helps us assess our data.
In probability theory, the central limit theorem (CLT) establishes that when independent random variables are added, their properly normalised sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.https://en.wikipedia.org/wiki/Central_limit_theorem
The key idea is when we have any random variable and we select a lot samples from it, the means of the samples follow the normal distribution.
The above is helpful when we try to apply the Z-test on data which originally is not normal. To do so we need to first calculate the properties of the new sampling distribution of the mean.
- The sample mean will be equal to the population mean (for large number of sample size). That is μ.
- The new variance will be equal to the original STD divided by the population size.
So a sampling distribution of any random variable is:
This formula will come handy below where we will need to combine our two variables. Remember: we have two random variables, the conversions of both control and variant group.
Combining our testing groups
The discussion on Z-score and Central Limit Theorem has covered a single random variable. In our original dataset we have two random variables; control and variant conversions. We need to combine them in order to get a single Z-score.
Based on mathematical properties when we sum two normally distributed random variables that result is another normally distributed r.v. (Wiki- Sum of Independent Normal Random Variables). This property applies to both in summation and subtraction. The general case is:
Joining it all together
Next step is to combine our two test groups into one that is normal and a z-test can be performed. The new variable will be denoted as Sc-v. “c” stands for control and “v” for variant. It will have a mean equal to the difference of the sampling means of control and variant groups. The variance will be equal to the sum of the variances of the sampling means.
Usually the Z-score requires a control value (μ0) versus which we perform the test. But because our null hypothesis states that control and variant means are equal (difference equal to 0), we are testing versus 0. So our Z-score calculation is:
Keep in mind that in this formula, the mean, variance and population values correspond to the original data from the AB test. For example:
- The mean of variant (μv) is equal to the conversion rate (i.e. 3%)
- The variance of the variant σv2 is μv*(1-μv). This is due to the properties of Bernoulli distribution.
- The population Nv is equal to the total number of visitors in the variant.
Let’s calculate the score as a practise assuming we have the below:
|Population N||Conversions||Mean μ||Variance σ2|
The sample data above give us a Z-score of 2.6534. But is this good or bad?
Calculating statistical significance in AB Testing
We mentioned that Z-score provides the distance from the mean using the standard deviation as a measurement unit. Furthermore 95% percent of the values fall within the [-1.96, +1.96] range. This means that a Z-score of 2.6534 falls outside of the range. So its quite rare to occur given our confidence level of 95%. Let’s see the exact probability.
Because we are performing a two-tail test, we just want to test for non-equality. We do not care about which is bigger – we need to obtain the probability that for both negative and positive z-score. Hence we need to add the total area highlighted as red. Cumulative density function is used to calculate the area under the curve.
Our two groups had conversion rates of 2% and 2.19% – difference of 0.19% / uplift 9.5%. For our population size and conversion levels, there is 0.7% that this was due to random chance. Given that our confidence level is – usually – set to 5%, we reject the null hypothesis that the two groups have equal conversion rate.
We just went through the process of how exactly p-value is calculated in a typical AB testing scenario. Calculating the statistical significance in AB Testing activities always helps us assess the results of our efforts.
Hopefully this provides a good insight of the mathematical steps required. It is worth mentioning that in AB testing there are other key aspects that we need to keep in mind such as the power of the test and the significance level. Power is the probability of rejecting the null hypothesis when we should not as well that we need to consider. Significance level (or a) is the threshold we set that the detected difference (i.e. uplift in conversion) is due to chance. Usually this is set to 5% or 10%.
However preferred to keep the focus just on calculating the p-value from the original data to keep the post as specific as possible. Let me know for thoughts and feedback!