A/B testing with small samples

Note: the code used to generate the figures in this post can be found here in the “AB-testing-small-samples” repository on my GitHub page.

In this post we look at the problem of A/B testing with small sample sizes. This is a tricky situation for several reasons. First, the statistical test that is commonly used to analyze A/B tests makes approximations that are not appropriate for small sample sizes. The result of these approximations is that the usual test will not reliably control for type 1 errors (false positives) when the samples are small. The second issue has to do with the power of the test to detect a real difference between the two groups (here we are talking about statistical power). No matter what statistical test we use, the sample size might just be too small for us to reliably detect a real difference between the two groups if that difference is not very large. We’ll explore both of these issues below and make concrete recommendations on how to get the best results from an A/B test in this situation.

The setup of an A/B test

To start, let’s review the standard setup for an A/B test. In an A/B test we seek to compare the performance of two versions (versions A and B) of something. In the tech world this could be two versions of an ad that is being shown on a social media platform. In a medical context it could be two versions of a drug (or a placebo vs. a drug). To compare these two versions we take a sample of people and randomly divide them into two groups, groups A and B, and we give version A of the thing to the people in group A and version B to the people in group B. We then measure the number of “successes” in each group, where in the tech example a success would be having a user click on the ad they were shown, and in the medical example a success would be the successful treatment of a disease or symptom in a person. Finally, we compare the number of successes in group A to the number of successes in group B and use that information to reach a conclusion about whether one version is superior to the other.

Let \(n_A\) and \(n_B\) be the number of people in groups A and B. The outcome of the experiment is a measurement of the random variables \(X_A\) and \(X_B\), which are equal to the number of successes in groups A and B. We’ll use capital letters to denote these random variables, and we’ll use the lowercase letters \(x_A\) and \(x_B\) to denote the values of these variables that we observe when we do the A/B test. From these variables we can also define the success rates \(R_A\) and \(R_B\) for each group via \(R_A = X_A / n_A\) and likewise for group B. This is all the information we have available to decide whether one version performs better than the other version.

The statistical model that underlies the analysis of most A/B tests is the following. We assume that for version A the probability of a getting a success with a randomly selected individual is \(p_A\). Similarly, with version B we assume the success probability is \(p_B\). With this setup, if the outcomes for any two people are independent of each other (a standard assumption), then the probability of \(x_A\) successes in a group of size \(n_A\) is given by the binomial distribution,

\[P(X_A = x_A) = {n_A \choose x_A}p_A^{x_A}(1-p_A)^{n_A – x_A}\ .\]

A similar expression holds for group B.

Of course, the probabilities \(p_A\) and \(p_B\) are unknown to us. Therefore, we will need to use the measurements \(x_A\) and \(x_B\) (as well as the group sizes \(n_A\) and \(n_B\)) to make an inference about whether \(p_A\) is equal to \(p_B\) or not.

Throughout this post we’ll always be analyzing the A/B test results using a two-sided test. This is because we will be considering an A/B testing scenario where we do not have any expectation beforehand that one of the versions is better than the other version. We are open to the possibility that either version could be the better one, and simply want to test against the alternative hypothesis that \(p_A \neq p_B\). If in your application you do have an expectation (before conducting the experiment!) that one of the versions (say, version A) is better, then you can perform a one-sided test against the alternative hypothesis that \(p_A > p_B\), and you will gain some power to detect that particular alternative. However, if you decide to perform that one-sided test then you will not learn any information about the possibility that version B could be better.

The standard analysis for large samples

We’ll now briefly review the standard method for analyzing the test results when the samples are large.¹ The null hypothesis is that there is no real difference between the two versions, i.e., that \(p_A = p_B = p\). Since we are doing a two-sided test our alternative hypothesis is that \(p_A\neq p_B\). Under the null hypothesis the following test statistic \(Z\) has a standard normal distribution in the limit that the sample sizes go to infinity,

\[Z = \frac{R_A-R_B}{\sqrt{p(1-p)\big(\frac{1}{n_A} + \frac{1}{n_B}\big)}}\ .\]

To test the null hypothesis at a significance level \(\alpha\) (where \(\alpha = 0.05\) for the usual 5% significance level), we reject the null hypothesis only when \(z\), the observed value of \(Z\) (obtained by replacing \(X_A\) and \(X_B\) in \(Z\) by their observed values \(x_A\) and \(x_B\)), satisfies \(|z| \geq z_{\alpha/2}\), where \(z_{\alpha/2} > 0\) satisfies

\[\alpha = 2(1 – \Phi(z_{\alpha/2}))\ ,\]

and where \(\Phi(x)\) is the cumulative distribution function for the standard normal distribution.²

We cannot actually perform this test because we don’t know the true value of \(p\). Instead, we replace \(p\) by an estimate \(R\) that uses the data from both groups (this is valid under the null hypothesis that the success probabilities are the same for both groups),

\[R = \frac{X_A + X_B}{n_A + n_B}\ .\]

The final result is that we now use a different test statistic \(T\), sometimes called the Score statistic or the Wald statistic with pooled variance, defined by

\[T = \frac{R_A-R_B}{\sqrt{R(1-R)\big(\frac{1}{n_A} + \frac{1}{n_B}\big)}}\ ,\]

and we reject the null hypothesis if \(|t| \geq z_{\alpha/2}\), where \(t\) is the observed value of \(T\) (again, this is obtained by replacing \(X_A\) and \(X_B\) in \(T\) by their observed values \(x_A\) and \(x_B\)).

What goes wrong?

What goes wrong when the samples are small? The test will no longer control the type 1 error rate at the specified level \(\alpha\). In other words, our probability of rejecting the null hypothesis when it is true is no longer guaranteed to be equal to \(\alpha\). It could be more or less than \(\alpha\) depending on the specific data and situation, but the point is that we no longer have guaranteed control over the type 1 error rate.

In the figure below we show the results of a numerical simulation of the actual type 1 error rate for the approximate test for various values of the sample size \(n\) (we set \(n_A = n_B = n\) for simplicity) and \(p\), the common value of \(p_A\) and \(p_B\) under the null hypothesis. To generate each data point we repeated the following experiment 10000 times:

Draw \(x_A\) and \(x_B\) independently from the binomial distribution of size \(n\) and success probability \(p\).
Use the approximate test at significance level \(\alpha = 0.05\) to test the null hypothesis that the samples are drawn from the same distribution (we know they are).

We then divided the number of false positives (the number of times the approximate test gave a significant result) by 10000 to arrive at the data points shown in the figure. This experiment confirms that the approximate test does not control the type 1 error rate at the level \(\alpha\), and that it is possible for the actual type 1 error rate to be larger or smaller than \(\alpha\). Note, however, that we can already see the actual rate getting closer to \(\alpha\) as \(n\) gets larger.

This strange behavior happens because the approximate test relies on the large sample assumption in two places: (i) it assumes that \(Z\) follows a normal distribution, which is only strictly true as the sample sizes go to infinity, and (ii) it replaces the unknown true proportion \(p\) by the estimate \(R\), which could differ appreciably from \(p\) when the samples are too small for the law of large numbers to start taking over.

There is a simple rule of thumb³ that one can use to decide if a sample is too small to use the approximate test. Consider just group A for a moment. The estimator \(R_A\) has mean \(\mu = p_A\) and variance \(\sigma^2 = p_A(1-p_A)/n_A\). We might expect that the large sample approximation is not valid if the normal approximation to the distribution of \(R_A\) is not mostly contained within the interval [0,1]. The standard rule of thumb is to check that three standard deviations worth of the normal approximation lie in this interval, i.e., that \(\mu – 3\sigma \geq 0\) and \(\mu + 3\sigma \leq 1\). Of course, we don’t know the true values of \(\mu\) and \(\sigma\), so we again replace \(p_A\) with \(r_A = x_A/n_A\) in those expressions to arrive at these two conditions:

\[\begin{align} r_A – 3\sqrt{r_A(1-r_A)/n_A} &\geq 0 \\ r_A + 3\sqrt{r_A(1-r_A)/n_A} &\leq 1\ .\end{align}\]

The final rule of thumb for the A/B test is that one should only use the approximate large sample analysis if these conditions are satisfied for groups A and B.

Barnard’s exact test

We’ll now look at an exact statistical test that can be used to control the type 1 error rate in A/B tests with small sample sizes. Our preferred test is Barnard’s exact test, although other options are available. A common alternative is Fisher’s exact test, but we prefer Barnard’s test for two reasons. First, Fisher’s test only considers the space of possible outcomes that have the same total number \(x_A + x_B\) of successes as the observed sample, but this restriction does not really make sense in our context since a repeat of the experiment could easily yield a different result with a different number of total successes.⁴ Second, Barnard’s test generally has more power than Fisher’s test to detect cases where \(p_A \neq p_B\). We’ll show this below using the results of a numerical calculation done in Python.

Barnard’s test proceeds as follows. Under the general setup of the A/B test, the probability of observing \(x_A\) successes in group A and \(x_B\) successes in group B is

\[\begin{align} P(x_A, x_B) &= {n_A \choose x_A}{n_B \choose x_B}p_A^{x_A}(1-p_A)^{n_A – x_A}p_B^{x_B}(1-p_B)^{n_B – x_B} \\ &= {n_A \choose x_A}{n_B \choose x_B}p^{ x_A + x_B}(1-p)^{n_A + n_B – (x_A + x_B)}\ , \end{align}\]

where the second line holds under the null hypothesis that \(p_A = p_B = p\).

Recall the statistic \(T\) that we defined earlier in this post. The standard implementation of Barnard’s test⁵ uses a p-value that is equal to the sum of \(P(x’_A, x’_B)\) over all values \(x’_A,x’_B\) of the successes for each group that lead to a value \(t’\) of this statistic that satisfies \(|t’| \geq |t|\), where \(t\) is the value of this statistic for the observed data \(x_A\) and \(x_B\). In other words, the p-value for Barnard’s test is the sum of the probabilities of all possible outcomes \(x’_A,x’_B\) that lead to a value of the test statistic \(T\) that is more extreme (i.e., further from zero) than the observed value \(t\).

This prescription does not quite lead to a usable expression for the p-value though, because we again do not know the true value of \(p\). In Barnard’s test this difficulty is overcome by taking a maximum of the aforementioned sum over all possible values of \(p\). Therefore, the p-value for Barnard’s test is

\[P_\text{Barnard} = \max_{p\in[0,1]} \sum_{x’_A,\ x’_B \\ |t’| \geq |t|}{n_A \choose x’_A}{n_B \choose x’_B}p^{ x’_A + x’_B}(1-p)^{n_A + n_B – (x’_A + x’_B)}\ .\]

If we reject the null hypothesis when \(P_\text{Barnard} \leq \alpha\), then Barnard’s test guarantees that our probability of committing a type 1 error will be less than or equal to \(\alpha\).

The power of Barnard’s test

We mentioned in the introduction that, when the sample sizes are small, we should expect to lose statistical power no matter what test we use. This means that our test may not be sensitive enough to reliably detect a true difference between the success probabilities \(p_A\) and \(p_B\). In this section we show the results of some numerical simulations that calculate the power of Barnard’s test in a few specific cases where \(p_A \neq p_B\). In particular, we look at two cases where \(p_B – p_A = 0.1\), which is small but not too small (it is actually quite large in the context of comparing advertisements on the web). We’ll also compare the power of Barnard’s exact test to that of Fisher’s exact test, and we’ll see that Barnard’s test generally has more power for small samples. Note that in these simulations we again set \(n_A\) and \(n_B\) equal to the same value \(n\).

To generate each data point in the figures below we repeated the following experiment 1000 times:

Draw \(x_A\) from the binomial distribution of size \(n\) and success probability \(p_A\).
Draw \(x_B\) from the binomial distribution of size \(n\) and success probability \(p_B\).
Use Barnard’s and Fisher’s tests at significance level \(\alpha = 0.05\) to test the null hypothesis that the samples are drawn from the same distribution (we know they are not).

We then took the number of times that each test gave a significant result (i.e., the number of times that each test detected the difference between the two distributions) and divided it by 1000 to arrive at the data points for the power shown in the figures.

We can see from these results that the exact tests simply do not have a lot of power to detect differences in \(p_A\) and \(p_B\) when the samples are small. Typically, one would prefer a value of the power greater than 0.5, and a value above 0.8 is considered ideal (that would mean that, when there is a true difference, the test detects it at least 80% of the time).

One possible option for increasing the power in this situation would be to raise the significance level \(\alpha\), for example we could raise it from \(0.05\) to \(0.1\). This will increase the power at the expense of increasing our chances of getting a false positive result.

A final alternative is the following. In cases where we expect \(p_A\) and \(p_B\) to differ by a small amount, we can simply decide that it is not worth conducting a formal A/B test if we know we can only get a small number of participants. After all, we now know that such a test would be unlikely to detect a true difference if the difference is small. In this type of situation you may be able to use this knowledge of the lack of power of a small sample A/B test to make a convincing request for additional resources or funding so that you can conduct a test with a larger sample and more power to detect the difference you are looking for.

Footnotes

This standard test is just the usual statistics 101 test for comparing two proportions. To perform this test in R use the prop.test function. In Python use the proportions_ztest function in the statsmodels package.
This definition of \(z_{\alpha/2}\) is equivalent (in the limit where \(Z\) follows a standard normal distribution) to the requirement that \(P(|Z| \geq z_{\alpha/2}) = \alpha\).
See here under the section titled “Normal approximation”.
This restriction did make sense in the context of the Lady Tasting Tea experiment, which is the key example of an experiment that Fisher applied his exact test to.
In Python this implementation can be found in the barnard_exact function from the scipy package. In R it can be found in the barnard.test function in the Barnard package.