Wald’s Identity and the Expected Stopping Time
This is the third post in our ongoing series on the Sequential Probability Ratio Test (SPRT). In the first post in this series we introduced the idea of sequential hypothesis testing and then gave an introduction to the SPRT. In the second post we started to explore the mathematical details of the SPRT, and we showed how to prove that the stopping time \(T\) for the SPRT is finite under very general conditions.
This post will explore another aspect of the math behind the SPRT, namely Wald’s identity for the expected stopping time. This identity is the main tool that one would use for calculating the expected stopping time \(E[T]\) of the SPRT. In particular, it can be used to show that \(E[T]\) can be much smaller than the fixed sample size required for a Likelihood Ratio Test with the same type I and type II error rates as the SPRT. Wald’s identity can also be used to bound the true type I and type II error rates in the approximate form of the SPRT that we introduced in the first post.
In this post we will introduce Wald’s identity and then demonstrate a few applications of it. At the end of the post we will show how to prove this identity, following Wald’s original proof. For details on the notation used here, we recommend reviewing the first and second posts in this series (see links above).
In the fourth and final post of this series, we will look at concrete examples of the SPRT and use the results from this post to understand what is going on in those examples.
Statement of Wald’s Identity
To state Wald’s identity, we first need to introduce \(\varphi(h)\), the moment generating function of the \(Y_t\) variables (introduced in our second post),
\[\varphi(h) = E[e^{h Y_t}]\ ,\]
which is a function of the variable \(h\). Recall that we are assuming that the \(Y_t\) all have the same distribution, so it does not matter which value of \(t\) is used in this equation.
Suppose that the value \(h\) is such that \(\varphi(h)\) exists, is finite, and that \(|\varphi(h)| \geq 1\). Then Wald’s identity is the statement that, at the stopping time \(T\), we have
\[E\left[e^{h Z_T}\left(\varphi(h)\right)^{-T}\right] = 1\ . \tag{1}\]
Sometimes, when people use the phrase Wald’s identity (or Wald’s equation), they are referring to a less general form of this identity that applies specifically to the expectation of \(Z_T\) itself, namely the identity
\[E[Z_T] = E[T]\mu\ , \tag{2}\]
where \(\mu = E[Y_t]\) is the mean of the \(Y_t\).
This version of Wald’s identity can be easily derived from Eq. (1) above by differentiating that identity with respect to \(h\) and then evaluating the result at \(h=0\). Indeed, taking the derivative of Eq. (1) gives
\[E\left[Z_T e^{h Z_T}\left(\varphi(h)\right)^{-T} – T\varphi'(h) e^{h Z_T}\left(\varphi(h)\right)^{-T-1}\right] = 0\ ,\]
where \(\varphi'(h)\) denotes the first derivative of \(\varphi(h)\) with respect to \(h\). If we now set \(h=0\) and use \(\varphi(0) = 1\) and \(\varphi'(0) = E[Y_t] = \mu\), then we recover Eq. (2).
Bounding the type I and type II error rates
Although it may not be obvious from their appearance, Eqs. (1) and (2) are powerful tools for understanding the behavior of the SPRT. In this section we’ll demonstrate this by showing how Eq. (1) can be used to derive upper and lower bounds on the type I error rate \(\alpha\) of the SPRT. A similar argument can be used to derive bounds on the type II error rate \(\beta\). These bounds are useful for understanding the true values of these error rates in the approximate form of the SPRT that we introduced in the first post in this series.
To start, we’ll choose a specific value \(h_0\) of \(h\), namely a non-zero solution to the equation
\[\varphi(h) = 1\ .\]
In Lemma 2 of On Cumulative Sums of Random Variables1, Wald showed that this equation has a unique non-zero solution \(h_0\) for a large family of probability distributions that satisfy a few very mild assumptions. For simplicity, we will assume that we are working with one of these distributions, and we’ll look at a specific example below. We will also assume that \(h_0\) is positive.
With this choice of \(h = h_0\), Wald’s identity Eq. (1) takes the simple form
\[E\left[e^{h_0 Z_T}\right] = 1\ .\]
From here we note that, by definition of the stopping time \(T\), we have \(Z_T \leq \ln(A)\) or \(Z_T \geq \ln(B)\), and so we can condition the expectation value here on these two possibilities:
\[\begin{align}E\left[e^{h_0 Z_T}\right] &= E\left[e^{h_0 Z_T}|Z_T\leq\ln(A)\right]P(Z_T\leq\ln(A)) \\ &+ E\left[e^{h_0 Z_T}|Z_T\geq\ln(B)\right]P(Z_T\geq\ln(B))\ .\end{align}\]
In addition, under the null hypothesis we have \(\alpha = P(Z_T\geq\ln(B)) = 1 – P(Z_T\leq\ln(A))\), while under the alternative hypothesis we have \(\beta = P(Z_T\leq\ln(A)) = 1 – P(Z_T\geq\ln(B))\). It will also be useful to define quantities \(E_A\) and \(E_B\) by
\[\begin{align} E_B &= E\left[e^{h_0 Z_T}|Z_T\geq\ln(B)\right] \\ E_A &= E\left[e^{h_0 Z_T}|Z_T\leq\ln(A)\right]\ .\end{align}\]
Putting all this back into the simplified form of Wald’s identity from above, and assuming that the null hypothesis holds (since we want to bound \(\alpha\)), we currently have
\[E_A(1-\alpha) + E_B\alpha = 1\ ,\]
or
\[\alpha = \frac{1-E_A}{E_B-E_A}\ . \tag{3}\]
If we could derive upper and lower bounds on \(E_A\) and \(E_B\), then we would immediately get bounds on \(\alpha\). That is what we will do next.
To start, we will state some general bounds that one can derive for \(E_A\) and \(E_B\). The derivation of these can be found in Wald’s paper cited in Footnote 1 below. Then we will derive these in the special case of the Bernoulli distribution that we considered in our last post. The discrete nature of the Bernoulli distribution makes the bounds easier to derive.
For the case of \(E_B\), we can motivate these bounds by writing \(Z_T = Z_{T-1} + Y_T\) and then defining a random variable \(R = \ln(B) – Z_{T-1} > 0\) to be equal to how far away \(Z_{T-1}\) is from the value \(\ln(B)\) that is required for the SPRT to stop at \(T\) with \(Z_T\geq\ln(B)\). In terms of \(R\) we have \(Z_T = \ln(B) + Y_T – R\), and the condition \(Z_T\geq\ln(B)\) can also be rewritten as \(Y_T\geq R\). With this setup in mind, Wald’s general upper bound on \(E_B\) takes the form
\[E_B\leq B^{h_0}\eta\ , \]
where
\[\eta = \max_{r>0}E[e^{h_0(Y – r)}|Y-r\geq 0]\ ,\]
and \(Y\) is a random variable with the same distribution as the \(Y_t\). Since \(Y_T-R = Z_T – \ln(B)\), we can think of \(\eta\) as the maximum possible expected value of the overshoot of the SPRT at the stopping time, i.e., how much larger \(Z_T\) is compared to the value \(\ln(B)\) that it must reach for the SPRT to stop at the upper boundary.
If we put this upper bound together with the fact that \(E_B \geq e^{h_0\ln(B)} = B^{h_0}\), then we find these bounds on \(E_B\),
\[B^{h_0} \leq E_B \leq B^{h_0}\eta\ .\]
Using similar logic, one can also show that \(E_A\) obeys the bounds
\[A^{h_0}\eta’ \leq E_A \leq A^{h_0}\ ,\]
where
\[\eta’ = \min_{r>0}E[e^{h_0(Y + r)}|Y+r\leq 0]\ .\]
To motivate this result like we did above, we would instead define the random variable \(R\) as \(R = Z_{T-1} – \ln(A) > 0\), so that \(Z_T = \ln(A) + Y_T +R\).
Finally, using these bounds in Eq. (3) gives us final upper and lower bounds on \(\alpha\),
\[\frac{1-A^{h_0}}{B^{h_0}\eta-A^{h_0}\eta’} \leq \alpha \leq \frac{1-A^{h_0}\eta’}{B^{h_0}-A^{h_0}}\ .\]
As we mentioned above, this bound can be used to understand the behavior of the approximate form of the SPRT that we introduced in the first post in this series. We will apply it to specific examples in our next post.
Next, we’ll rederive this general bound in the special case of the Bernoulli example from our last two posts. Recall that in this example the original random variables \(X_t\) equal \(1\) with probability \(q\) and \(0\) with probability \(1-q\), where \(q=q_0\) under the null hypothesis and \(q=q_1\) under the alternative hypothesis. In terms of the \(X_t\), the random variables \(Y_t\) take the form (see our last post for more details)
\[Y_t = \ln\left(\frac{q_1}{q_0}\right)X_t + \ln\left(\frac{1-q_1}{1-q_0}\right)(1-X_t)\ .\]
Let us assume that \(q_1 > q_0\). Then with probability \(q\) we have
\[Y_t = \ln\left(\frac{q_1}{q_0}\right) > 0\ ,\]
and with probability \(1-q\) we have
\[Y_t = \ln\left(\frac{1-q_1}{1-q_0}\right) < 0\ ,\]
which leads to
\[\varphi(h) = q\left(\frac{q_1}{q_0}\right)^h + (1-q)\left(\frac{1-q_1}{1-q_0}\right)^h\ .\]
One can check that, for \(q_1>q_0\) and under the null hypothesis (\(q=q_0\)), the non-zero solution \(h_0\) to \(\varphi(h) = 1\) has \(h_0>0\). We will use this fact in what follows.
In this example we of course still have \(E_A \leq A^{h_0}\) and \(B^{h_0}\leq E_B\). To bound \(E_B\) from above, we write \(Z_T = Z_{T-1} + Y_T\) and use the facts that \(Z_{T-1}<\ln(B)\) and \(Y_T\leq\ln\left(\frac{q_1}{q_0}\right)\). This leads to
\[E_B \leq e^{h_0\ln(B)+h_0\ln\left(\frac{q_1}{q_0}\right)} = B^{h_0}\left(\frac{q_1}{q_0}\right)^{h_0} \ ,\]
which takes the form of the general bound from above with
\[\eta = \left(\frac{q_1}{q_0}\right)^{h_0}\ ,\]
and in this case one can check by a direct calculation that this agrees with \(\eta\) as defined above.
Using the same ideas, we can also obtain the following lower bound for \(E_A\)
\[A^{h_0} \left(\frac{1-q_1}{1-q_0}\right)^{h_0} \leq E_A \ ,\]
which also takes the form of the general bound from above with
\[\eta’ = \left(\frac{1-q_1}{1-q_0}\right)^{h_0}\ .\]
One can again check that this agrees with \(\eta’\) as defined in the general bound above.
Bounding the expected stopping time
The same ideas from the previous section can be applied to the second form of Wald’s identity — Eq. (2) above — to obtain upper and lower bounds on \(E[T]\) (the expectation value of the stopping time). In this case we first use Eq. (2) to write \(E[T] = E[Z_T]/\mu\), and then condition \(E[Z_T]\) on \(Z_T \leq \ln(A)\) and \(Z_T \geq \ln(B)\) as before,
\[\begin{align} E[T] &= \frac{1}{\mu} E\left[Z_T|Z_T\leq\ln(A)\right]P(Z_T\leq\ln(A)) \\ &+ \frac{1}{\mu}E\left[Z_T|Z_T\geq\ln(B)\right]P(Z_T\geq\ln(B)) \ . \end{align}\]
Focusing on the null hypothesis, we again have \(\alpha = P(Z_T\geq\ln(B)) = 1 – P(Z_T\leq\ln(A))\).
In this case it is also possible to derive general bounds on the conditional expectation values appearing in the equation above.2 They can again be motivated by defining the same random variables \(R\) from the previous section. In this case one finds that
\[ \ln(A) + \xi’ \leq E\left[Z_T|Z_T\leq\ln(A)\right] \leq \ln(A)\ ,\]
and
\[\ln(B) \leq E\left[Z_T|Z_T\geq\ln(B)\right] \leq \ln(B) + \xi\ ,\]
where \(\xi\) and \(\xi’\) are given by
\[\begin{align}\xi &= \max_{r>0}E[Y – r|Y -r\geq 0] \nonumber \\ \xi’ &= \min_{r>0}E[Y + r|Y +r\leq 0]\ .\nonumber \end{align}\]
Combining these results leads to a bound on \(E[T]\) of the form
\[\frac{(1-\alpha)}{\mu}\xi’ \leq E[T] – \frac{1}{\mu}\left((1-\alpha)\ln(A) + \alpha\ln(B)\right) \leq \frac{\alpha}{\mu}\xi\ . \tag{4}\]
This result holds when the null hypothesis is true and shows that \(E[T]\) is close to the value
\[\frac{1}{\mu}\left((1-\alpha)\ln(A) + \alpha\ln(B)\right)\ .\]
It is possible to derive a similar result when the alternative hypothesis is true.
Among other uses, the inequality in Eq. (4) can be used to show that \(E[T]\) for an SPRT with error rates \(\alpha\) and \(\beta\) can be much smaller than the fixed sample size required for a Likelihood Ratio Test with the same error rates (recall from our first post that this is one of the key advantages of the SPRT). We will look at a specific example in the next post.
Finally, we can again specialize to the Bernoulli example, and in that case one can show via a direct calculation, or by use of the general formulas for \(\xi\) and \(\xi’\) from above, that Eq. (4) holds with
\[\begin{align}\xi &= \ln\left(\frac{q_1}{q_0}\right) \\ \xi’ &= \ln\left(\frac{1-q_1}{1-q_0}\right) \ . \end{align}\]
Proof of Wald’s identity
To close this post we will now present a proof of Wald’s identity, closely following Wald’s original proof from the paper cited in Footnote 1 below. The key to the proof is to not focus on \(Z_T\) directly, but to instead focus on \(Z_{t_0}\) for a fixed time \(t_0\). At the end of the proof we will take the limit \(t_0 \to \infty\) to obtain the final result.
Starting with \(Z_{t_0}\), which is just a sum of \(t_0\) independent variables \(Y_t\), we immediately know that
\[E\left[e^{h Z_{t_0}}\right] = \left(\varphi(h)\right)^{t_0}\ .\]
On the other hand, we can use the law of total probability, and the fact that \(P(T = \infty) = 0\) (for a proof of that fact, see the second post in this series) to write
\[E\left[e^{h Z_{t_0}}\right] = \sum_{t=1}^{\infty}E\left[e^{h Z_{t_0}}|T = t\right]P(T=t)\ .\]
We will now separate this out into two terms: one for \(t <= t_0\), and another for \(t > t_0\):
\[E\left[e^{h Z_{t_0}}\right] = \text{Term 1} + \text{Term 2}\ ,\]
where
\[\begin{align} \text{Term 1} &= \sum_{t=1}^{t_0}E\left[e^{h Z_{t_0}}|T = t\right]P(T=t) \\ \text{Term 2} &= \sum_{t=t_0 + 1}^{\infty}E\left[e^{h Z_{t_0}}|T = t\right]P(T=t)\ .\end{align}\]
To evaluate Term 1 we first rewrite \(Z_{t_0}\) in the form \(Z_{t_0} = Z_t + (Z_{t_0}-Z_t)\). Since we are conditioning on \(T=t\) with \(t\leq t_0\), in this term \(Z_t\) is independent of \(Z_{t_0}-Z_t\), and \(Z_{t_0}-Z_t\) itself is just the sum of \(t_0 – t\) independent variables. This leads to
\[\begin{align} \text{Term 1} &= \sum_{t=1}^{t_0}E\left[e^{h Z_t}e^{h (Z_{t_0}-Z_t)}|T = t\right]P(T=t) \\ &= \sum_{t=1}^{t_0}(\varphi(h))^{t_0 – t}E\left[e^{h Z_t}|T = t\right]P(T=t)\ .\end{align}\]
The next step is to derive a bound on Term 2. In this case, since we are conditioning on \(T=t\) with \(t > t_0\), we know that \(\ln(A) < Z_{t_0} < \ln(B)\). This means that, within Term 2, we have the bound
\[\left|E\left[e^{h Z_{t_0}}|T = t\right]\right| < \max(A^h,B^h)\ ,\]
and so
\[\begin{align}\left|\text{Term 2}\right| &< \max(A^h,B^h)\sum_{t=t_0 + 1}^{\infty}P(T=t) \\ &= \max(A^h,B^h)P(T > t_0)\ .\end{align}\]
At this point we can take the equation \(E\left[e^{h Z_{t_0}}\right] = \text{Term 1} + \text{Term 2}\) and divide both sides by \((\varphi(h))^{t_0}\) to find
\[1 = \sum_{t=1}^{t_0}(\varphi(h))^{-t}E\left[e^{h Z_t}|T = t\right]P(T=t) + \frac{\text{Term 2}}{(\varphi(h))^{t_0}}\ . \]
Subtracting the sum from Term 1 from both sides, and then applying the inequality from above gives
\[\begin{align} \left| 1 – \sum_{t=1}^{t_0}(\varphi(h))^{-t}E\left[e^{h Z_t}|T = t\right]P(T=t) \right| &= \left|\frac{\text{Term 2}}{(\varphi(h))^{t_0}}\right| \\ &< \frac{\max(A^h,B^h)}{|\varphi(h)|^{t_0}}P(T > t_0)\ .\end{align}\]
Finally, we can take the limit \(t_0\to\infty\) to derive the final result. In this limit the right-hand side of this inequality goes to zero for the following reasons. First, we know (see the second post in this series) that \(\lim_{t_0\to\infty}P(T > t_0) = 0\). Next, \(A\), \(B\), and \(h\) are constants that are independent of \(t_0\), so \(\max(A^h,B^h)\) does not change as \(t_0\) grows. Finally, as noted above, we chose \(h\) so that \(|\varphi(h)| \geq 1\). Combining all of these facts together shows that the right-hand side of the inequality above goes to zero as \(t_0 \to \infty\).
On the left-hand side of our inequality, we can see that in the \(t_0\to\infty\) limit we have
\[\begin{align} \lim_{t_0\to\infty}\sum_{t=1}^{t_0}(\varphi(h))^{-t}E\left[e^{h Z_t}|T = t\right]P(T=t) &= \sum_{t=1}^{\infty}E\left[e^{h Z_t}(\varphi(h))^{-t}|T = t\right]P(T=t) \\ &= E\left[e^{h Z_T}(\varphi(h))^{-T}\right]\ , \end{align} \]
and then, since the right-hand side of the inequality goes to zero, we find that this quantity is equal to 1,
\[E\left[e^{h Z_T}(\varphi(h))^{-T}\right] = 1\ .\]
This completes the proof of Wald’s identity.
Footnotes
- Wald, A. (1944). On Cumulative Sums of Random Variables. The Annals of Mathematical Statistics, 15(3), 283–296. http://www.jstor.org/stable/2236250.
- Wald, A. (1945). Sequential Tests of Statistical Hypotheses. The Annals of Mathematical Statistics, 16(2), 117–186. http://www.jstor.org/stable/2235829.
Leave a Reply