The Markov chain approach to CUSUM

Note: the code used to do the calculations in this post can be found here in the “change-point-detection” repository on my GitHub page.

In the last post we introduced the problem of online change point detection and the CUSUM method for solving that problem. In this post we’ll dive deeper into the math you need to implement CUSUM. There are several different ways of modeling the CUSUM process, and we won’t cover them all here. Instead, we’ll focus on the most powerful and flexible approach, which is the approach based on Markov chains.¹ As this post builds on the content and examples in the previous post, we recommend that readers start with the previous post before reading this one.

In the last post we used a particular example to guide our discussion of change point detection, and we’ll continue to focus on that example here. Briefly, we considered the problem of a factory that wants to monitor the condition of a machine that produces a certain product. The manufacturer of the machine guarantees that, when the machine is functioning normally, 5% or less of the products it produces will be defective. The factory is able to do regular maintenance on the machine (and check its condition) after every 250 products, but they want some way to monitor the condition of the machine between maintenance periods. This way, if they detect that the machine might have malfunctioned, they can stop it early and inspect it before they waste resources producing a large number of defective products.²

CUSUM provided us with a statistical approach to this problem with guaranteed type I and type II error properties, both measured in terms of the Average Run Length (ARL). Under the null hypothesis (i.e., when the machine is running normally), the ARL tells us how long on average it will take for CUSUM to give a false alarm (type I error). On the other hand, when the machine has malfunctioned, the ARL tells us how long on average it will take for CUSUM to detect that the machine is not functioning normally. If the ARL in the latter case is smaller than the remaining time until the next maintenance period, then the malfunction will be detected early, thereby preventing the manufacture of a large number of defective products.

In the last post we introduced CUSUM and explained what the ARL was, but we did not explain how to actually calculate the ARL. In this post we’ll explain how this can be done by formulating CUSUM in terms of a Markov chain with an absorbing state.

Review of setup and notation

We’ll start by reviewing the setup from our previous post. Let \(i \in\{1,2,3,\dots\}\) enumerate the products produced by the machine as it starts running, with the first product labelled by \(i=1\), the second by \(i=2\), etc. For each product \(i\) we define a random variable \(U_i\) that equals 0 if the product is satisfactory and 1 if the product is defective.

For the null and alternative hypotheses that we will explore in this post, we will assume that the \(U_i\) are i.i.d. random variables. We use \(q_0\) to represent the probability that a product is defective when the machine is working normally (null hypothesis), and \(q_1\) to represent the probability that a product is defective when the machine is malfunctioning (alternative hypothesis). We assume that \(q_0 < q_1\) as there is a higher probability of producing a defective product when the machine is malfunctioning. We also use the notation \(P_0(\cdots)\) and \(P_1(\cdots)\) to denote probabilities under the null and alternative hypothesis, respectively. Then we have \(P_0(U_i = 1) = q_0\) and \(P_1(U_i = 1) = q_1\) for all \(i\).

For the particular score function that we used for the CUSUM approach to our example problem, we assumed that \(q_0\) took the form

\[q_0 = \frac{1}{n}\ \]

for some integer \(n >1\). In our original example with \(q_0 = 0.05\) (5% change of a defective product) we would then have \(n = 20\).

To define the score for CUSUM we first defined the contribution \(X_i\) of each product \(i\) to the overall score as

\[X_i = n U_i – 1\ .\]

Our total score \(S_i\), which is the running cumulative sum that gives CUSUM its name, was then defined by \(S_0 = 0\) and

\[S_i = S_{i-1} + X_i = \sum_{j=1}^i X_j\ .\]

Finally, CUSUM sounds an alarm (i.e., signals a possible change) at the first value of \(i\) at which \(S_i\) rises a height \(h\) above its previous minimum value,

\[S_i – \min_{j<i}(S_j) \geq h\ . \]

The height \(h\) is a control parameter for CUSUM, and it is up to the user to determine an appropriate value of \(h\) for the problem at hand.

Formulating CUSUM as a Markov chain

At the very end of the last post, we started to explain how CUSUM can be formulated as a Markov chain. The key idea was to first define a modified score \(S’_i\) via the following formulas. First, we again define \(S’_0 = 0\). Then, for any \(i > 0\) we define

\[S’_i = \begin{cases} h\ , & S’_{i-1} + X_i > h \\ S’_{i-1} + X_i\ , & 0 \leq S’_{i-1} + X_i \leq h \\ 0\ , & S’_{i-1} + X_i < 0\ . \end{cases} \]

The modified score \(S’_i\) always takes values between 0 and \(h\) (inclusive). In addition, it has the useful property that it resets itself to 0 any time the original score \(S_i\) reaches a new minimum. This property implies that, when written in terms of \(S’_i\), the condition for CUSUM to sound an alarm takes the extremely simple form

\[ S’_i = h\ . \]

In other words, CUSUM sounds an alarm as soon as the modified score \(S’_i\) reaches the value \(h\) for the first time.

As is typical in statistics, we’ll try to detect a change point by comparing what is actually happening with the machine to what we would expect to happen if the null hypothesis held, i.e., if the machine continued to operate normally for all time. Then, to understand how long it will take CUSUM to detect a change after one has occurred, we’ll look at what we would expect to happen if the machine has already malfunctioned, i.e., if the alternative hypothesis holds.

What this all means is that, when we do our analysis, we will assume that either the null hypothesis or the alternative hypothesis holds for all time. Mathematically, this means that we will assume that the \(U_i\) are i.i.d. and that for all \(i\) we either have \(P(U_i = 1) = q_0\) or \(P(U_i = 1) = q_1\) for the null or alternative hypothesis, respectively.

To formulate CUSUM as a Markov chain, the next thing we need to do is identify the probability that the modified score will transition from a given value \(s\) to a new value \(t\) after the next product \(i+1\) is created. Since we are assuming that either the null or alternative hypothesis holds for all time, and that the \(U_i\) are i.i.d., this probability is independent of \(i\) and we can denote it by \(p_{s,t}\). It is defined by

\[p_{s,t} = P(S’_{i+1} = t|S’_i = s)\ ,\]

which is the conditional probability that \(S’_{i+1} = t\) given that \(S’_i = s\). As we mentioned above, \(s,t\) denote possible values of the modified score \(S’_i\), and so they take values between \(0\) and \(h\), inclusive. For brevity we have also omitted the “0” or “1” subscripts indicating probabilities calculated under the null or alternative hypothesis. Of course, the transition matrix will be different in the two cases.

Since CUSUM stops the first time that \(S’_i = h\), this Markov chain has an absorbing state at the value \(h\). In terms of the transition matrix \(p_{s,t}\), this means that \(p_{h,h} = 1\) and \(p_{h,t} = 0\) for all \(t\neq h\) (recall also that \(\sum_{t = 0}^h p_{s,t} = 1\) for all \(s\) since this is a Markov chain). In other words, once the chain gets to the value \(h\), it can never leave again — the probability of transitioning out of that state is zero. This is what we mean by saying that \(h\) is an absorbing state for the Markov chain.

For our problem the transition matrix \(p_{s,t}\) takes a fairly simple form. We already know what it looks like in the row with \(s=h\). To express \(p_{s,t}\) for \(s < h\), it is useful to define two variables \(t_{+}\) and \(t_{-}\) via

\[\begin{align}t_{+} &= \min(s + n – 1, h) \\ t_{-} &= \max(s – 1, 0)\ .\end{align}\]

These are the two possible values the modified score can take on when it starts from \(s\) and a new product is produced. To understand why, recall that a satisfactory product (\(U_i = 0\)) decreases the score by 1 (\(X_i = -1\)), which would normally send \(s\) to \(s-1\). However, if it happened that \(s-1\) was less than zero, then the rules for calculating \(S’_{i+1}\) from \(S’_i\) would yield \(S’_{i+1} = 0\) for the modified score. This explains the value of \(t_{-}\). In a similar way, a defective product (\(U_i = 1\)) increases the score by \(n-1\) (\(X_i = n-1\)), which would normally send \(s\) to \(s+n-1\). However, if it happened that \(s+n-1 > h\), then the rules for calculating \(S’_{i+1}\) from \(S’_i\) would yield \(S’_{i+1} = h\) for the modified score. This explains the value of \(t_{+}\).

Let \(q\) be the probability that a product is defective, so we have \(q=q_0\) or \(q = q_1\) for the null or alternative hypothesis, respectively. Then, given \(t_{+}\) and \(t_{-}\) that we introduced above, the transition matrix for \(s<h\) can be written in the simple form

\[p_{s,t} = (1-q)\delta_{t,t_{-}} + q\delta_{t,t_{+}}\ ,\]

where \(\delta_{s,t}\) is the Kronecker delta function (the function that equals 1 when \(s=t\) and 0 otherwise). The only difference in \(p_{s,t}\) under the null hypothesis vs. the alternative is the value of \(q\).

Suppose now that the modified score is currently at the value \(s\). Let the random variable \(T_s\) be equal to the time it takes³ for the modified score to reach the value \(h\) starting from \(s\). The expectation value of \(T_s\), denoted by \(t_s\) (i.e., \(t_s = E[T_s]\)) is known as the expected hitting time for the state \(h\) starting from \(s\). The ARL that we are interested in for CUSUM is exactly equal to \(t_0\),

\[\text{ARL} = t_0 = E[T_0]\ ,\]

and this is simply because in CUSUM we start the modified score from the initial value \(S’_0 = 0\). Next, we’ll review some facts about Markov chains that will help us calculate these expected hitting times.

Some facts about Markov chains with an absorbing state

To calculate the expected hitting times \(t_s\), we will need to know a few basic results about Markov chains with an absorbing state.⁴ Let \(P\) be the Markov transition matrix with matrix elements \(p_{s,t}\). It is a \((h+1)\times(h+1)\) matrix since \(s\) and \(t\) can take on the \(h+1\) values \(0,1,2,\dots,h\). Since \(p_{h,h} = 1\) and \(p_{h,t} = 0\) for all \(t\neq h\), it is useful to focus on the \(h\times h\) matrix \(R\) that is obtained by removing the last row and column from \(P\). In fact, \(P\) can be written in terms of \(R\) as

\[P = \begin{pmatrix} R & \mathbf{v} \\ \mathbf{0}^T & 1 \end{pmatrix}\ ,\]

where the other quantities appearing here are as follows. First, \(\mathbf{0}\) is the \(h\times 1\) column vector consisting of all zeros, and \(\mathbf{0}^T\) is its transpose (the \(1\times h\) row vector consisting of all zeros). Next, \(\mathbf{v}\) is an \(h\times 1\) column vector that is related to \(R\) via

\[\mathbf{v} = (\mathbb{I} – R)\cdot\mathbf{1}\ ,\]

where \(\mathbb{I}\) is the \(h\times h\) identity matrix and \(\mathbf{1}\) is the \(h\times 1\) column vector consisting of all ones (and the dot denotes multiplication of the vector by the matrix). This relation between \(R\) and \(\mathbf{v}\) follows immediately from the fact that \(\sum_{t=0}^h p_{s,t} = 1\) for all \(s\) because \(P\) is a Markov transition matrix (the sum of the probabilities for transitions out of any state \(s\) must equal one). Finally, the 1 in the bottom right corner of \(P\) is just the final entry \(p_{h,h} =1\).

The matrix \(R\) plays a crucial role in determining the properties of the hitting times \(T_s\). We always have \(T_h = 0\) since if you start in the state \(h\) then you are stuck there. But for all other values of \(s\neq h\) the expected hitting time \(t_s = E[T_s]\) is known to satisfy the recurrence relation

\[t_s = 1 + \sum_{s’=0}^h p_{s,s’} t_{s’}\ . \tag{1}\]

This equation is actually very intuitive because it says that \(t_s\) is equal to 1 plus a weighted sum of the expected hitting times \(t_{s’}\) for all states \(s’\) that can be reached from \(s\) in one time step. In addition, the weighting factor for \(t_{s’}\) is equal to \(p_{s,s’}\), which is just the probability of transitioning from \(s\) to \(s’\) in one time step.

Since \(t_h = 0\), the sum over \(s’\) in Eq. (1) can be restricted to the range \(0,1,2,\dots,h-1\), and in that range we have \(p_{s,s’} = r_{s,s’}\), where \(r_{s,s’}\) are the matrix elements of the matrix \(R\) that we introduced above (recall that we are assuming that \(s\neq h\)). Then Eq. (1) can be rewritten in the form,

\[t_s = 1 + \sum_{s’=0}^{h-1} r_{s,s’} t_{s’}\ ,\]

and from this form we can see that the solution is given by

\[t_s = \big[(\mathbb{I} – R)^{-1}\cdot\mathbf{1}\big]_s\ , \tag{2}\]

i.e., \(t_s\) is equal to the \(s\)th element of the vector obtained by multiplying the vector \(\mathbf{1}\) by the inverse of the matrix \(\mathbb{I} – R\). This formula reduces the calculation of the expected hitting times \(t_s\) to a linear algebra problem that can be easily solved with a computer.

The above formula Eq. (2) for \(t_s\) is all we need to calculate the ARLs for CUSUM. However, it is worth mentioning that the Markov chain approach can also give us more detailed information about the behavior of the hitting times \(T_s\). For example, the probability that \(T_s\) is less than or equal to a certain integer value \(\tau\) is given by

\[P(T_s \leq \tau) = \big[(\mathbb{I} – R^{\tau})\cdot\mathbf{1}\big]_s\ \tag{3} ,\]

which is equal to the \(s\)th element of the vector obtained by multiplying the vector \(\mathbf{1}\) by the matrix \(\mathbb{I} – R^{\tau}\).

This result in Eq. (3) can be derived with a bit of matrix algebra if one starts with a basic fact about Markov chains with an absorbing state: the probability of reaching \(h\) starting from \(s\) in \(\tau\) or fewer steps is given by a simple matrix element of \(P^{\tau}\),⁵

\[P(T_s \leq \tau) = \big[P^{\tau}\big]_{s,h}\ .\]

You will also need to use the following formula for \(P^{\tau}\), which you should be able to convince yourself of by trying the examples of \(\tau=2,3,\dots\),

\[P^{\tau} = \begin{pmatrix} R^{\tau} & (\mathbb{I} + R +\cdots+R^{\tau-1})\cdot\mathbf{v} \\ \mathbf{0}^T & 1 \end{pmatrix}\ .\]

This result in Eq. (3) is very useful if one would like to apply more sophisticated approaches to type I and type II error control in CUSUM. For example, one can consider approaches that go beyond the ARL and instead place a guaranteed bound on the probability that CUSUM will raise an alarm before a certain time under the null hypothesis.

Calculating the Average Run Length (ARL)

We are now ready to complete our calculation of the ARL for our example problem. In fact, what we actually want to do is determine a value for \(h\) that will yield the desired ARL for our problem under the null hypothesis. Recall from our discussion in the previous post that we wanted to choose \(h\) so that, under the null hypothesis, the ARL is close to 250, which is when the machine undergoes its regularly scheduled maintenance. This choice meant that, when the machine is functioning normally, CUSUM will (on average) sound an alarm only around the regularly scheduled maintenance time. This will prevent the factory from wasting resources inspecting the machine early when CUSUM gives a false alarm.

Using our results from the previous section, we are now in a position to calculate the ARL for any value of \(h\). If we consider the null hypothesis with \(q_0 = 0.05\) (or \(n=20\)) for our example problem, then the figure below shows the value of the ARL for a range of different choices of \(h\). In particular, choosing \(h = 63\) gives an ARL of about 255, which is close enough to 250 to be a suitable choice for our example problem.⁶ The dashed red line in the figure indicates our target value of 250 for the ARL under the null hypothesis. It is also interesting to see the cusps in the figure at \(h = 19 = n – 1\) and \(h = 38 = 2(n-1)\), where \(19 = n-1\) is the size of the upward steps for this CUSUM process.

Now that we have an appropriate value of \(h\), we can also calculate what the ARL would be under the alternative hypothesis. We’ll focus on a specific alternative hypothesis with \(q_1 = 0.1\), i.e., a doubling of the probability that a product is defective relative to the null hypothesis. Recall though that even under the alternative hypothesis, the score contributions \(X_i\) are still defined using the parameter \(n\) that is related to \(q_0\) from the null hypothesis (\(q_0 = 1/n\)).

For our analysis of the alternative hypothesis, it is interesting to calculate not only the ARL, which is equal to \(t_0\), but also the expected hitting times \(t_s\) for all other values of \(s\). The reason for this is, in a real scenario, the modified score for CUSUM could already be at a non-zero value \(s\) when a change occurs, and so it is of interest to know how long it would take CUSUM to detect a change in a case like this. The plot below shows the values of \(t_s\) for all \(s\) between 0 and \(h\) under our alternative hypothesis and with \(h = 63\).

We can see from the figure that the expected hitting times \(t_s\) decrease fairly rapidly from a maximum value of about 58.5 for \(t_0\) (which is also the ARL under the alternative hypothesis). This is much less than 250, which means that we have a good chance of detecting a malfunction early if the malfunction occurs long before the next scheduled maintenance period.

Footnotes

D. Brook, D. A. Evans, “An approach to the probability distribution of cusum run length.” Biometrika, Volume 59, Issue 3, December 1972, Pages 539–549, https://doi.org/10.1093/biomet/59.3.539
We assume that, when the machine malfunctions, it has a higher probability of producing a defective product than it does when it is functioning normally.
Here, by “time it takes” we really mean the number of products produced before the modified score reaches the value \(h\).
A great set of notes on Markov chains can be found here: https://www.statslab.cam.ac.uk/~rrw1/markov/M.pdf
The reason this expression gives the probability that \(T_s\) is less than or equal to \(\tau\), and not the probability that \(T_s\) is equal to \(\tau\), is as follows. Since \(h\) is an absorbing state, if the Markov chain is known to be in the state \(h\) at time \(\tau\), it could have arrived in that state any time before \(\tau\), since it must stay in the state \(h\) once it reaches that state.
Choosing \(h=62\) gives an ARL of about 248. But it is actually nice to have the ARL be a bit larger than 250 to give some extra protection against false positives.