How to prove that the stopping time is finite
In the first post in this series we introduced the concept of sequential hypothesis testing and then gave an introduction to the simplest sequential test, the Sequential Probability Ratio Test (SPRT). That post was intended to provide the basic information that you would need to start applying the SPRT, but it did not provide the mathematical details that explain why the SPRT works the way it does. In this post and the next we are going to provide those mathematical details. Then, in the final post in this series (which will contain four posts in total), we’ll put everything together by first applying the SPRT to an example, and then using the mathematical properties of the SPRT to explain exactly what is going on in that example.
Review of the basic setup
Recall that our basic setup is a stream of independent and identically distributed (i.i.d.) random variables \(X_1,X_2,X_3,\dots\), which we denote by \(X_t\) for \(t = 1,2,3,\dots\). Each variable \(X_t\) has the same distribution \(f_{\theta}(x)\), where \(P(X_t \leq x) = \int_{-\infty}^x dy\ f_{\theta}(y)\) when the \(X_t\) are continuous random variables and \(P(X_t = x) = f_{\theta}(x)\) when the \(X_t\) take on discrete values. Here, \(\theta\) denotes the parameters that define the distribution. We will write \(\theta_j\), using an index \(j\in\{0,1\}\), to indicate parameter values under the null (\(j=0\)) and alternative (\(j=1\)) hypotheses, respectively. Also, lowercase \(x_t\) denotes the observed value of \(X_t\), and \(\vec{x}_t = (x_1,x_2,\dots,x_t)\) denotes the vector of observed values of the first \(t\) random variables. For more details on our notation see the first post in this series — any notation not explained below is defined there.
The basic object of study in the SPRT is the likelihood ratio at each time \(t\),
\[\Lambda_t(\vec{x}_t) = \frac{\mathcal{L}_{\theta_1}(\vec{x}_t)}{\mathcal{L}_{\theta_0}(\vec{x}_t)}\ ,\]
where \(\mathcal{L}_{\theta_j}(\vec{x}_t) = \prod_{s=0}^t f_{\theta_j}(x_s)\) is the likelihood function at time \(t\) under hypothesis \(j\). To carry out the test we also need to choose two positive numbers \(A\) and \(B\) satisfying
\[0 < A < 1 < B\ .\]
These numbers determine the type I and type II error rates of the test.
Finally, the actual test proceeds as follows. At each time \(t\), we compare \(\Lambda_t(\vec{x}_t)\) to \(A\) and \(B\). The rules for how the test proceeds are:
- If \(\Lambda_t(\vec{x}_t) \leq A\), stop the test and accept the null hypothesis \(H_0\).
- If \(\Lambda_t(\vec{x}_t) \geq B\), stop the test and accept the alternative hypothesis \(H_1\).
- If \(A < \Lambda_t(\vec{x}_t) < B\), continue testing (i.e., collect the next observation \(x_{t+1}\) and repeat this analysis).
We defined the stopping time \(T\) of the test to be the random variable equal to the value of \(t\) at which the test finishes. In other words, it is the smallest value of \(t\) for which condition 1 or 2 from above is satisfied.
New notation for the proof
For the proof here it will be useful to work with the logarithm of the likelihood ratio as our basic variable. This makes everything additive rather than multiplicative, and is easier to work with. To this end we define new random variables \(Y_t\) at each time \(t\) by
\[Y_t = \ln\left(\frac{f_{\theta_1}(X_t)}{f_{\theta_0}(X_t)}\right)\ ,\]
so \(Y_t\) is the logarithm of the likelihood ratio for the single random variable \(X_t\). If we also define \(Z_t\) to be the cumulative sum of the \(Y_s\) for \(s \leq t\),
\[Z_t = \sum_{s=1}^t Y_s = \ln(\Lambda_t)\ ,\]
and define new constants \(a\) and \(b\) by
\[a = \ln(A), \ b = \ln(B)\ ,\]
then we can reformulate the stopping conditions of the SPRT in terms of \(Z_t\). Specifically, let \(z_t\) be the observed value of \(Z_t\). Then the SPRT can be reformulated as:
- If \(z_t \leq a\), stop the test and accept the null hypothesis \(H_0\).
- If \(z_t \geq b\), stop the test and accept the alternative hypothesis \(H_1\).
- If \(a < z_t < b\), continue testing (i.e., collect the next observation \(x_{t+1}\) and repeat this analysis).
Before moving on, it will be helpful to see what the \(Y_t\) actually look like in a concrete example. Consider the case where the \(X_t\) are Bernoulli random variables, i.e., they can only take the values 0 or 1. Let \(q_j\), for \(j \in \{0,1\}\), be the probability that \(X_t = 1\) under hypothesis \(j\) (so \(\theta_j = q_j\) in this case). Then we have
\[f_{\theta_j}(X_t) = q_j^{X_t} (1 – q_j)^{1-X_t}\]
and
\[Y_t = \ln\left(\frac{q_1}{q_0}\right)X_t + \ln\left(\frac{1-q_1}{1-q_0}\right)(1-X_t)\ .\]
It will also be helpful to look at the mean and variance of \(Y_t\) in this example. In particular, we will see that these are non-zero if \(q_0 \neq q_1\). Since \(X_t\) is a Bernoulli random variable with \(P(X_t = 1) = q_j\) under hypotheses \(j\), the mean and variance for \(X_t\) are \(E[X_t] = q_j\) and \(\mathbb{V}[X_t] = q_j (1-q_j)\), respectively, and so the mean and variance of \(Y_t\) under hypothesis \(j\) are then
\[\begin{align} E[Y_t] &= \left(\ln\left(\frac{q_1}{q_0}\right) – \ln\left(\frac{1-q_1}{1-q_0}\right)\right) q_j + \ln\left(\frac{1-q_1}{1-q_0}\right) \\ \mathbb{V}[Y_t] &= \left(\ln\left(\frac{q_1}{q_0}\right) – \ln\left(\frac{1-q_1}{1-q_0}\right)\right)^2 q_j(1-q_j)\ . \end{align} \]
We can see in this example that both of these are non-zero, and the sign of \(E[Y_t]\) depends on the values of \(q_0\) and \(q_1\). We’ll now return to the general case in order to state the main result on the stopping time.
Statement of the result
The stopping time result that we will prove in this post is as follows. Let \(\mu\) and \(\sigma^2\) be the mean and variance, respectively, of the \(Y_t\),
\[E[Y_t] = \mu\ ,\ \mathbb{V}[Y_t] = \sigma^2\ .\]
We will assume that \(\mu\) and \(\sigma^2\) are non-zero. You can easily check that this is the case for many common choices of the distribution function \(f_{\theta}(x)\) (we verified this for the Bernoulli distribution in the last section). Then we have:
If \(\mu\) and \(\sigma\) are non-zero, then \(P(T = \infty) = 0\). (Lemma 1 from Wald1\(^{,}\)2)
In other words, if \(\mu\) and \(\sigma\) are non-zero, then the stopping time of the SPRT is guaranteed to be finite. As we discussed in the first post, this fact is a crucial ingredient for proving the bounds that relate \(A\) and \(B\) to the type I and type II error rates of the SPRT (equations 2 and 3 from our first post).
Proof strategy
Our proof will closely follow Wald’s original proof from the paper cited above, but we will fill in all the supporting details that are only mentioned briefly in Wald’s proof. The strategy of the proof is as follows. First we note that, for any positive integers \(r\) and \(\ell\), we have
\[P(T = \infty) \leq P(T > (\ell+1)r)\ .\]
This inequality simply states that the probability that \(T\) is infinite is less than or equal to the probability that the SPRT stops sometime after \((\ell+1)r\) time steps.
Next, we’ll show that the event \(T > (\ell+1)r\) implies the occurrence of an entire set of \(\ell+1\) independent events \(E_k\), for \(k\in\{0,1,\dots,\ell\}\). In other words, the event \(T > (\ell+1)r\) implies the event \(\cap_{k=0}^{\ell}E_k\) (the intersection of the events \(E_k\), i.e., \(E_0\) occurs and \(E_1\) occurs and so on). If the occurrence of an event \(E\) implies the occurrence of another event \(E’\), then we have \(E \subseteq E’\), and so \(P(E) \leq P(E’)\). This gives us \(P(T > (\ell+1)r) \leq P(\cap_{k=0}^{\ell}E_k)\), and so
\[P(T = \infty) \leq P(\cap_{k=0}^{\ell}E_k)\ .\]
Next, the fact that the \(E_k\) are all independent implies that \(P(\cap_{k=0}^{\ell}E_k) = \prod_{k=0}^{\ell}P(E_k)\), and so
\[P(T = \infty) \leq \prod_{k=0}^{\ell}P(E_k)\ .\]
Finally, we will show that there exists a value \(r^*\) such that, if we chose \(r > r^*\), then for all \(k\) we will have
\[P(E_k) \leq \epsilon\]
for some \(\epsilon < 1\). Crucially, \(r^*\) and \(\epsilon\) will depend only on \(\mu,\sigma,A\), and \(B\), but will not depend on \(\ell\).
At this point we will have shown that
\[P(T = \infty) \leq \epsilon^{\ell+1}\]
for any \(\ell \geq 1\). Then, since \(r^*\) and \(\epsilon\) do not depend on \(\ell\), we can take the \(\ell \to \infty\) limit of this expression to complete the proof that \(P(T=\infty) = 0\). We’ll now fill in the rest of the details of this proof.
The events \(E_k\)
The first step is to define the events \(E_k\) and show that they are independent. To start, note that \(T > (\ell+1)r\) implies that
\[ a < Z_t < b \tag{1}\]
for all \(t \leq (\ell+1)r\). Now consider times \(t\) of the form \(t = (k+1)r\), for \(k\in\{0,1,\dots,\ell\}\), and note that3
\[Z_{(k+1)r} – Z_{kr} = \sum_{t=kr+1}^{(k+1)r}Y_t\ .\]
Since Eq. (1) holds for \(t = kr\) and \(t = (k+1)r\), this relation implies that
\[ a – b < \sum_{t=kr+1}^{(k+1)r}Y_t \]
and
\[\sum_{t=kr+1}^{(k+1)r}Y_t < b – a\ .\]
Since \(a < 0\), if we define
\[c = |a| + b\ ,\]
then we find that \(T > (\ell+1)r\) implies that
\[ – c < \sum_{t=kr+1}^{(k+1)r}Y_t < c \tag{2}\]
for all \(k\in\{0,1,\dots,\ell\}\). The inequality in Eq. (2) defines the events \(E_k\). In addition, it is clear that these events are independent of each other since the \(Y_t\) are independent and two different events \(E_k\) and \(E_{k’}\) (with \(k\neq k’\)) do not involve any of the same random variables.
Bounding the probability of \(E_k\)
To bound the probability of \(E_k\), we’ll start by defining new random variables \(\tilde{Z}_k\) by
\[\tilde{Z}_k =\sum_{t=kr+1}^{(k+1)r}Y_t\ .\]
Then we want to bound the probability that \(\tilde{Z}_k\) lies in the open interval \((-c,c)\). For later use we note that, in terms of \(\mu\) and \(\sigma\), the mean and variance of the \(\tilde{Z}_k\) are
\[E[\tilde{Z}_k] = r\mu\ ,\ \mathbb{V}[\tilde{Z}_k] = r\sigma^2\ ,\]
and then \(E[(\tilde{Z}_k)^2] = r^2\mu^2 + r\sigma^2\).
We’ll derive the bound on \(P(E_k)\) using the Paley-Zygmund inequality, but for completeness we will derive the bound from scratch. For this discussion we’ll assume that \(\mu > 0\). A similar result can be derived for \(\mu < 0\).
To start, let \(\mathcal{C}\) denote the open interval \((-c,c)\), and consider the expectation value of one of the random variables \(\tilde{Z}_k\). We can write
\[E[\tilde{Z}_k] = E\big[\tilde{Z}_k\mathbb{I}[\tilde{Z}_k\in\mathcal{C}]\big] + E\big[\tilde{Z}_k\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\big] \ , \tag{3}\]
where \(\mathbb{I}[\tilde{Z}_k\in\mathcal{C}]\) is the indicator function for the event that \(\tilde{Z}_k\) is in \(\mathcal{C}\) (it equals 1 if \(\tilde{Z}_k\) is in \(\mathcal{C}\) and 0 otherwise) and \(\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\) is the indicator function for the event that \(\tilde{Z}_k\) is not in \(\mathcal{C}\).
Let’s focus on the first term here. When \(\tilde{Z}_k\) is in \(\mathcal{C}\) we clearly have \(\tilde{Z}_k \leq c\), and so
\[\begin{align}E\big[\tilde{Z}_k\mathbb{I}[\tilde{Z}_k\in\mathcal{C}]\big] &\leq c E\big[\mathbb{I}[\tilde{Z}_k\in\mathcal{C}]\big] \\ &= cP(\tilde{Z}_k\in\mathcal{C}) \\ &\leq c\ .\end{align}\]
Moving on to the second term, we can write
\[\begin{align} E\big[\tilde{Z}_k\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\big] &\leq \Big| E\big[\tilde{Z}_k\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\big] \Big| \\ &\leq \sqrt{E\big[(\tilde{Z}_k)^2\big]E\big[\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\big]}\ , \end{align}\]
where, to get from the first to the second line, we used the Cauchy-Schwarz inequality
\[\Big|E[f(\tilde{Z}_k)g(\tilde{Z}_k)]\Big| \leq \sqrt{E[f(\tilde{Z}_k)^2]E[g(\tilde{Z}_k)^2]}\]
with the functions \(f\) and \(g\) given by \(f(\tilde{Z}_k) = \tilde{Z}_k\) and \(g(\tilde{Z}_k) = \mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\). Then we also used the fact that \(\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]^2 = \mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\) (since an indicator function can only equal 0 or 1, the square of an indicator function is equal to itself).
Continuing on, we know that \(E\big[(\tilde{Z}_k)^2\big] = r^2\mu^2 + r\sigma^2\) and \(E\big[\mathbb{I}[\tilde{Z}_k\notin\mathcal{C}]\big] = P(\tilde{Z}_k\notin\mathcal{C})\). If we apply all of these bounds back to Eq. (3), then we find
\[r\mu \leq c + \sqrt{(r^2\mu^2 + r\sigma^2)P(\tilde{Z}_k\notin\mathcal{C})}\ ,\]
or
\[r\mu – c \leq \sqrt{(r^2\mu^2 + r\sigma^2)(1-P(\tilde{Z}_k\in\mathcal{C}))}\ , \tag{4}\]
where we subtracted \(c\) from both sides and used \(P(\tilde{Z}_k\notin\mathcal{C}) = 1 – P(\tilde{Z}_k\in\mathcal{C})\).
Now if \(r\mu -c < 0\), then Eq. (4) is not useful since the left-hand side is negative while the right-hand side is positive. But, if we choose \(r\) large enough so that \(r\mu – c > 0\), then this equation allows us to bound \(P(\tilde{Z}_k\in\mathcal{C})\) from above. Specifically, if we square both sides of Eq. (4) and then rearrange the terms, we find that
\[P(E_k) = P(\tilde{Z}_k\in\mathcal{C}) \leq \epsilon \ , \]
where
\[\epsilon = 1 – \frac{\left(\mu-\frac{c}{r}\right)^2}{\mu^2 + \frac{\sigma^2}{r}}\ .\]
Finally, it is clear that \(0 < \epsilon < 1\) since, if we choose \(r\) so that \(r\mu – c > 0\), we have \((\mu-\frac{c}{r})^2 < \mu^2\), while \(\mu^2 + \frac{\sigma^2}{r}\) is clearly larger than \(\mu^2\). Thus, \(r^*\) is defined by the condition that \(r\mu – c > 0\), and so \(r^*\) and \(\epsilon\) do not depend on the value of the integer \(\ell\) used in the proof, which then allows us to take \(\ell \to \infty\). This result completes the proof.
Footnotes
- Wald, A. (1944). On Cumulative Sums of Random Variables. The Annals of Mathematical Statistics, 15(3), 283–296. http://www.jstor.org/stable/2236250.
- Actually, Wald does not include the requirement that \(\mu \neq 0\) in his statement of this lemma, but for simplicity we will assume this condition here and take advantage of it in our proof.
- For the case of \(k=0\) in the equation below, we can define \(Z_0 = 0\).