This will be the first post in a series of posts on the topic of sequential hypothesis testing. Specifically, these posts will focus on the Sequential Probability Ratio Test (SPRT), which is one of the simplest and most well-known examples of a sequential test.
When conducting a standard statistical test, we first need to decide in advance on a sample size for the test. Then we need to wait until we have collected enough samples to reach our desired sample size. It is only then that we can finally perform our statistical analysis and calculate things like test statistics and p-values. Any deviation from this strict timing, like performing the statistical analysis before reaching our pre-specified sample size, will introduce uncontrolled errors into our results.
Sequential testing, on the other hand, allows us to perform our analysis continuously as new data arrives, and does not require us to decide on a sample size in advance. It allows all this while still rigorously controlling the error rates of the test (i.e., the risks of false positive or false negative outcomes). This flexibility makes sequential testing a very attractive alternative to ordinary statistical testing.
In fact, sequential testing offers one more huge advantage over ordinary statistical tests:
A sequential test can, on average, reach a decision using a much smaller number of samples than a standard statistical test with the exact same error rates.
In other words, by leveraging sequential testing, we can often reach a conclusion of the same quality (same risks of false positive or false negative outcomes) but using significantly fewer samples.
Because of this huge savings in the number of samples required to reach a conclusion, Abraham Wald’s original report on the SPRT was classified by the United States government during World War II.1 It is easy to imagine how useful this methodology would have been for quality control of materials being manufactured for the war effort.
For modern data scientists, sequential testing is useful for all kinds of different problems. To take just one example, Netflix has recently written about how they use sequential testing to rapidly identify bugs or other issues in new software updates.2
In this post we’ll introduce the SPRT and discuss its key properties. After reading this post, you will have all the information you need to start using a very good approximate form of the SPRT. Then, in the next few posts, we’ll explore the SPRT in more detail, including example calculations and an in-depth look at some of the mathematics that the SPRT is based on.
Basic setup and notation
In the basic setup for sequential testing, we have a stream of data coming in, and we analyze the data as it comes in instead of waiting for a specific number of observations before we do our analysis. We’ll label the data by \(X_t\), for \(t=1,2,3,\dots\), where the index \(t\) here represents time (i.e., we first receive the data point \(X_1\), then \(X_2\), and so on).
We assume that the \(X_t\) are independent and identically distributed (i.i.d.) random variables, and our goal is to make an inference about the probability distribution \(f_{\theta}(x)\) that the \(X_t\) are drawn from. Here, \(\theta\) represents a parameter or set of parameters that define the distribution of the \(X_t\). For example, if the \(X_t\) were normally distributed, then we would have \(\theta = (\mu,\sigma)\) and
\[f_{\theta}(x) =\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{\sigma^2}}\ ,\]
where \(\mu\) and \(\sigma^2\) are the mean and variance, respectively, of the normal distribution. The hypotheses that we’ll want to test will be statements about the values of the parameter(s) \(\theta\).
In the case where the \(X_t\) are continuous random variables, \(f_{\theta}(x)\) is the usual probability density function. When the \(X_t\) are discrete, \(f_{\theta}(x)\) instead represents the actual probability of the event that the random variable takes on the value \(x\). In other words, we have \(P(X_t \leq x) = \int_{-\infty}^x dy\ f_{\theta}(y)\) in the continuous case and \(P(X_t = x) = f_{\theta}(x)\) in the discrete case. We use the lowercase \(x_t\) to denote the observed value of the random variable \(X_t\), and
\[\vec{x}_t = (x_1,x_2,\dots,x_t)\]
to denote the vector of observed values of \(X_1,X_2,\dots,X_t\).
The likelihood function plays a major role in the theory of the SPRT. Recall that for \(t\) i.i.d. random variables, the likelihood function \(\mathcal{L}_t(\vec{x}_t)\) is defined to be equal to the product of the function \(f_{\theta}(x)\) for all observations up to time \(t\),
\[\mathcal{L}_{\theta}(\vec{x}_t) = \prod_{s=1}^t f_{\theta}(x_s)\ .\]
This is just the joint probability distribution of \(X_1,X_2,\dots,X_t\).
With this notation in place, we are now ready to introduce the SPRT.
The Sequential Probability Ratio Test (SPRT)
The Sequential Probability Ratio Test (SPRT), invented by Wald,3 is one of the simplest sequential tests. It applies to a scenario where under both the null and the alternative hypotheses the parameter(s) \(\theta\) take on specific fixed values (sometimes called “simple hypotheses”), which we’ll denote by \(\theta_0\) for the null hypothesis and \(\theta_1\) for the alternative hypothesis:
- Null Hypothesis \(H_0\): \(\theta = \theta_0\)
- Alternative Hypothesis \(H_1\): \(\theta = \theta_1\)
For the SPRT, we do not consider hypotheses where \(\theta\) is allowed to take on a range of different values.
In the case where \(f_{\theta}(x)\) is a normal distribution, an example of an allowed set of hypotheses is \(\mu = \mu_0\), \(\sigma = 1\) under the null hypothesis and \(\mu = \mu_1\), \(\sigma = 1\) under the alternative, i.e., two normal distributions with different means \(\mu_0\) and \(\mu_1\), but with the same variance \(\sigma^2 = 1\). An example set of hypotheses that would not be allowed is \(\mu = \mu_0\), \(\sigma = 1\) under the null and \(\mu > \mu_0\), \(\sigma = 1\) under the alternative. This is not allowed because \(\mu\) does not have a specific fixed value under the alternative hypothesis.
The main object that we analyze in the SPRT is the likelihood ratio \(\Lambda_t(\vec{x}_t)\) at each time \(t\), which is just the likelihood function under the alternative hypothesis divided by the likelihood function under the null hypothesis:
\[\Lambda_t(\vec{x}_t) = \frac{\mathcal{L}_{\theta_1}(\vec{x}_t)}{\mathcal{L}_{\theta_0}(\vec{x}_t)}\ .\]
Just like in the ordinary Likelihood Ratio Test, we expect \(\Lambda_t(\vec{x}_t)\) to be less than 1 if the null hypothesis is true and greater than 1 if the alternative is true.
Given the definition of \(\Lambda_t(\vec{x}_t)\), the SPRT then works as follows. We first choose positive constants \(A\) and \(B\) satisfying4
\[0 < A < 1 < B\ .\]
These constants will determine the type I and type II error rates of the test, as we discuss in the next section.
Next, at each time \(t\), we compare \(\Lambda_t(\vec{x}_t)\) to \(A\) and \(B\). The rules for how the test proceeds are:
- If \(\Lambda_t(\vec{x}_t) \leq A\), stop the test and accept the null hypothesis \(H_0\).
- If \(\Lambda_t(\vec{x}_t) \geq B\), stop the test and accept the alternative hypothesis \(H_1\).
- If \(A < \Lambda_t(\vec{x}_t) < B\), continue testing (i.e., collect the next observation \(x_{t+1}\) and repeat this analysis).
It is a non-trivial fact that the SPRT is guaranteed to stop in a finite amount of time.5 In other words, the probability that \(A < \Lambda_t(\vec{x}_t) < B\) for all \(t\) is equal to zero. In an upcoming blog post, we’ll show how to prove this fact by following another paper by Wald.6 For now though, we’ll assume that this is true and move on to a discussion of the key properties of this test.
Key properties of the SPRT
In this section we’ll give an overview of the key properties that give the SPRT a major advantage over standard statistical tests with fixed sample sizes. We briefly discussed these in the introduction, but we’ll go into the details here. We’ll also show a quick way to obtain approximate values of \(A\) and \(B\) so that you can quickly start using the SPRT without requiring complicated calculations to set up the test. To explain these properties, we’ll first need to review the definitions of type I and type II error rates, and then introduce the concept of the stopping time of the SPRT.
In any statistical test (sequential or fixed sample size) with null and alternative hypotheses, the type I error rate is the probability of rejecting \(H_0\) when it is true (false positive), and the type II error rate is the probability of rejecting \(H_1\) when it is true (false negative). The type I error rate is usually denoted by \(\alpha\), and the type II error rate is usually denoted by \(\beta\). Typical values for these that are often used in practice are \(\alpha = 0.05\) and \(\beta = 0.2\). This choice of \(\alpha\) leads to the usual 5% level for statistical significance, and this choice of \(\beta\) guarantees that the statistical power, or the probability of accepting \(H_1\) when it is true, is equal to \(1-\beta = 0.8\).
Next, the stopping time \(T\) of the SPRT is a random variable that is equal to the earliest time \(t\) such that \(\Lambda_t(\vec{x}_t) \leq A\) or \(\Lambda_t(\vec{x}_t) \geq B\). In other words, it is the first time where either condition 1 or 2 from the last section is satisfied and so the SPRT comes to a stop. \(T\) is a random variable because it depends on the other random variables \(X_t\) that we are measuring (and it is a stopping time, in the precise sense of the term, because it only depends on the random variables \(X_t\) for \(t\leq T\)).
The first key property of the SPRT is that it does not require a fixed sample size. This means that (1) you do not need to decide on a sample size before running the test, and (2) you do not need to wait for a certain amount of data to be collected before you start your analysis. Instead, the test runs continuously and you evaluate new data points as they arrive until the test comes to a stop (i.e., until condition 1 or 2 from above is satisfied).
In addition to the convenience of being able to run the test continuously, the SPRT also has one key quantitative advantage over standard statistical tests. To state this advantage we first note that, since the SPRT is based on the likelihood ratio, the standard statistical test that we should compare it to is the Likelihood Ratio Test. The Neyman-Pearson lemma tells us that, for a given type I error rate \(\alpha\) and sample size \(n\), the Likelihood Ratio Test is the most powerful test (i.e., it achieves the largest possible value of \(1-\beta\)) among all standard statistical tests that could be used to compare two hypotheses \(H_0\) and \(H_1\) of the form that we are considering here.
Knowing this, we can now state the key advantage of the SPRT in comparison to the Likelihood Ratio Test. Consider a Likelihood Ratio Test with type I error rate \(\alpha\), sample size \(n\), and statistical power \(1-\beta\). Then it is possible for the stopping time \(T\) of the SPRT with the same type I error rate \(\alpha\) and power \(1-\beta\) to satisfy
\[E[T] < n \tag{1}\ ,\]
where \(E[\cdots]\) denotes the expectation value of the quantity inside the square brackets. The inequality in Eq. (1) means that, on average, the SPRT can reach a decision sooner (i.e., requiring fewer samples) than the Likelihood Ratio Test with the same error rates \(\alpha\) and \(\beta\). In practice, \(E[T]\) can be significantly smaller than \(n\), as is demonstrated in the paper by Wald that we cited in Footnote 1 below.
It is a remarkable fact that, by adopting a sequential testing methodology, we can significantly reduce the number of samples required to reach a conclusion without increasing the chances of making an error. As we mentioned in the introduction, this discovery was so important that Wald’s theory of sequential testing was classified by the US government during World War II.
The last property that we’ll look at in this section is a set of inequalities that the constants \(A\) and \(B\) must satisfy in an SPRT with error rates \(\alpha\) and \(\beta\). These bounds are useful for obtaining approximate values of \(A\) and \(B\) that will allow you to quickly start using the SPRT. For readers who want to know more about these bounds, we explain how to derive them in the last section of this post.
In an SPRT with error rates \(\alpha\) and \(\beta\), \(A\) and \(B\) satisfy the inequalities
\[\frac{\beta}{1-\alpha} \leq A \tag{2}\]
and
\[B \leq \frac{1-\beta}{\alpha}\tag{3}\ , \]
so that for the SPRT we have
\[\frac{\beta}{1-\alpha} \leq A < 1 < B \leq \frac{1-\beta}{\alpha}\ .\]
Note that this relation implies that we need to choose \(\beta < 1 – \alpha\) for the test to be consistent.
It is also interesting to note from here that \(\alpha\) and \(\beta\) must satisfy
\[\beta \leq A\ ,\]
and
\[\alpha \leq \frac{1}{B}\ .\]
These follow from Eqs. (2) and (3) because \(1-\alpha\) and \(1-\beta\) are always less than or equal to one. So these bounds also place constraints on how large \(\alpha\) and \(\beta\) can be for any choice of \(A\) and \(B\). For example, by choosing \(A = 0.2\) and \(B = 20\), we can rigorously guarantee that \(\alpha \leq 0.05\) and \(1-\beta \geq 0.8\).
In practice, we can often choose \(A\) and \(B\) to be equal to the values that lie at the edges of the regions defined by these inequalities, i.e., we can take \(A = \beta/(1-\alpha)\) and \(B = (1-\beta)/\alpha\), and this is usually accurate enough to start using the SPRT. In other words, with this choice of \(A\) and \(B\) we will have \(\alpha_{\text{true}}(A,B) \approx \alpha\) and \(\beta_{\text{true}}(A,B) \approx \beta\), where \(\alpha_{\text{true}}(A,B)\) and \(\beta_{\text{true}}(A,B)\) are the true values of the type I and type II error rates in the SPRT with these values of \(A\) and \(B\). We will test how good this approximation is in an upcoming blog post, where we will show all the details of using the SPRT in an example.
If your application does require exact values of \(\alpha\) and \(\beta\), then it is also possible to determine the values of \(A\) and \(B\) that provide those exact values, but that calculation is harder to do and we won’t go into it here. One possible approach is similar to the Markov chain approach that we used for CUSUM in our previous blog post.
Summary — how to start using the SPRT right now
Using the material we’ve gone over so far, you can immediately start using a very good approximate form of the SPRT by following these simple steps:
- Decide on your null and alternative hypotheses \(H_0\) and \(H_1\), and the form of the distribution function \(f_{\theta}(x)\) under both hypotheses.
- Decide on the type I and type II error rates for your test. A typical choice is \(\alpha = 0.05\) and \(\beta = 0.2\).
- Choose \(A = \beta/(1-\alpha)\) and \(B = (1-\beta)/\alpha\).
- Run the SPRT by following the steps 1-3 that we outlined earlier in this post.
Stay tuned for the next posts in this series, where we’ll look at examples and then show how to prove some of the important properties of the SPRT.
Appendix: Deriving the bounds on \(A\) and \(B\)
In this final section we will derive the bounds on \(A\) and \(B\) from Eqs. (2) and (3) above. The derivation will rely on the fact that the SPRT is guaranteed to stop in finite time. As we mentioned above, we’ll come back and prove this fact in an upcoming blog post.
As above, let \(T\) be the stopping time of the SPRT. The fact that the SPRT is guaranteed to stop in finite time means that we have
\[P(T = \infty) = 0\ , \tag{4}\]
which implies that
\[\sum_{t=1}^{\infty}P(T = t) = 1 \tag{5}\ .\]
Here, \(P(\cdot)\) denotes the probability of the event in the parentheses. Eq. (5) may seem like it should be obvious, but if the SPRT was not guaranteed to stop after a finite amount of time then we would have \(\sum_{t=1}^{\infty}P(T = t) < 1\) and \(P(T = \infty) > 0\). It is also important to note that Eqs. (4) and (5) hold under both the null and alternative hypotheses.
We’ll now derive the bound on \(A\). The derivation of the bound on \(B\) is similar, so we won’t show all the details for that one. To start, note that \(\beta\) is the probability of rejecting \(H_1\) when it is true. If we use \(P_1(\cdots\)) to denote the probability of an event under the alternative hypothesis, then we have
\[\beta = P_1(\Lambda_T \leq A)\ ,\]
where \(\Lambda_T\) is the likelihood ratio at the stopping time \(T\). This equation says that \(\beta\) is equal to the probability (under the alternative hypothesis) that, at the stopping time \(T\), the likelihood ratio \(\Lambda_T\) is less than or equal to \(A\), leading us to reject the alternative hypothesis. Note also that this equation only makes sense because we know that \(T\) is finite.
Since we know that the SPRT stops in finite time (as expressed in Eqs. (4) and (5) above), we can use the law of total probability to express this as
\[\beta = \sum_{t=1}^{\infty}P_1(\Lambda_T \leq A,T=t)\ ,\]
where, for any events \(E_1\) and \(E_2\), \(P(E_1,E_2)\) is equal to the probability that the two events \(E_1\) and \(E_2\) both occur (i.e., the probability of \(E_1\) and \(E_2\)).
The next step in the proof is to bound the joint probability \(P_1(\Lambda_T \leq A,T=t)\). To do this, we’ll use a concrete expression for this probability as an integral over the vector \(\vec{x}_t = (x_1,\dots,x_t)\) of observed values of the first \(t\) random variables.
To obtain a concrete expression for \(P_1(\Lambda_T < A,T=t)\), we simply need to remember that, in order for both of the events \(\Lambda_T \leq A\) and \(T=t\) to happen, the likelihood ratios \(\Lambda_1,\dots,\Lambda_t\) must satisfy all of the following conditions:
\[\Lambda_s \in (A,B)\ \forall s \in\{1,2,\dots,t-1\}\ ,\]
and
\[\Lambda_t \leq A\ .\]
In other words, the likelihood ratios \(\Lambda_1,\Lambda_2,\dots,\Lambda_{t-1}\) must all take values in the open interval \((A,B)\), and \(\Lambda_t\) must be less than or equal to \(A\).
If we define \(E_s\) to be the event that \(\Lambda_s \in (A,B)\) and \(E_s’\) to be the event that \(\Lambda_s \leq A\), then we we have
\[P_1(\Lambda_T \leq A,T=t) = P_1(E_1,\dots,E_{t-1},E_t’)\ , \tag{6}\]
where, for any events \(E_1,E_2,\dots,E_n\), the expression \(P(E_1,E_2,\dots,E_n)\) denotes the probability of \(E_1\) and \(E_2\) and… and \(E_n\), which is similar to the notation that we used above for the case of two events.
The probability on the right-hand side of Eq. (6) can be written as an integral over \(\vec{x}_t\) as
\[P_1(E_1,\dots,E_{t-1},E_t’) = \int d\vec{x}_t\ \mathbb{I}[E_1,\dots,E_{t-1},E_t’]\ \mathcal{L}_{\theta_1}(\vec{x}_t)\ ,\]
where \(d\vec{x}_t = \prod_{s=1}^t dx_s\) and \(\mathbb{I}[E_1,\dots,E_{t-1},E_t’]\) is the indicator function that equals one if all of the events \(E_1,\dots,E_{t-1},E_t’\) are true and zero otherwise. Finally, the likelihood function \(\mathcal{L}_{\theta_1}(\vec{x}_t)\) with \(\theta = \theta_1\) appears here because we are calculating a probability under the alternative hypothesis.
The last step in deriving the bound is to note that, since the indicator function enforces the condition that \(\Lambda_t \leq A\), we have
\[\begin{align} \mathbb{I}[E_1,\dots,E_{t-1},E_t’]\ \mathcal{L}_{\theta_1}(\vec{x}_t) &= \mathbb{I}[E_1,\dots,E_{t-1},E_t’]\ \Lambda_t(\vec{x}_t) \mathcal{L}_{\theta_0}(\vec{x}_t) \\ &\leq A\ \mathbb{I}[E_1,\dots,E_{t-1},E_t’]\ \mathcal{L}_{\theta_0}(\vec{x}_t)\ . \end{align}\]
Then we have
\[P_1(\Lambda_T \leq A,T=t) \leq A \int d\vec{x}_t\ \mathbb{I}[E_1,\dots,E_{t-1},E_t’]\ \mathcal{L}_{\theta_0}(\vec{x}_t)\ ,\]
and the integral on the right-hand side is now the probability of \(\Lambda_T \leq A\) and \(T=t\) occurring under the null hypothesis,
\[\int d\vec{x}_t\ \mathbb{I}[E_1,\dots,E_{t-1},E_t’]\ \mathcal{L}_{\theta_0}(\vec{x}_t) = P_0(\Lambda_T \leq A,T=t) \ .\]
At this point we have established that
\[P_1(\Lambda_T \leq A,T=t) \leq A\ P_0(\Lambda_T \leq A,T=t)\ .\]
Finally, summing both sides over all values of \(t\) gives \(P_1(\Lambda_T \leq A) \leq A\ P_0(\Lambda_T \leq A)\), and then we know that \(P_0(\Lambda_T \leq A) = 1 – \alpha\) since it is the probability of accepting the null hypothesis when it is true. This yields the final bound
\[\beta \leq A(1-\alpha)\ ,\]
which matches Eq. (2) from above.
The derivation of the bound involving \(B\) is similar, so we won’t go through all the details here, but we will mention the starting point. In this case we use the fact that \(1-\beta\) is the probability of accepting the alternative hypothesis when it is true (i.e., the statistical power of the test), and so we have
\[1 – \beta = P_1(\Lambda_T \geq B)\ .\]
Then, using similar manipulations as above, we can derive the bound \(1-\beta \geq B\alpha\), which matches Eq. (3) from above.
Footnotes
- Wald, A. (1945). Sequential Tests of Statistical Hypotheses. The Annals of Mathematical Statistics, 16(2), 117–186. http://www.jstor.org/stable/2235829.
- Blog post: https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df. Technical paper: https://arxiv.org/abs/2205.14762
- The paper by Wald that we cited in Footnote 1 is an excellent reference for the material in the rest of this post.
- For readers that want to compare to Wald’s paper from Footnote 1, our \(A\) is his \(B\) and vice-versa.
- More precisely, it is guaranteed to stop in a finite amount of time under mild assumptions on the distribution \(f_{\theta}(x)\). See the reference in the next footnote.
- Wald, A. (1944). On Cumulative Sums of Random Variables. The Annals of Mathematical Statistics, 15(3), 283–296. http://www.jstor.org/stable/2236250.