Online change point detection and CUSUM

Note: the code used to generate the figures in this post can be found here in the “change-point-detection” repository on my GitHub page.

In this post we’ll start to look at change point detection, which is the problem of detecting a sudden change in a parameter that characterizes some ongoing process. There are actually two kinds of change point detection: online and offline. In online change point detection one attempts to detect a change as the process is running, using only the data collected up to the current time. In contrast, offline change point detection takes place after the process has completed and all the data is available. In this case one uses the data from the entire process to try and determine where sudden changes might have occurred. For many real-world problems the online version is more appropriate, especially if it is important to take action as soon as possible after a change occurs. For this reason, we will focus on online change point detection here.

An example that we will return to throughout our discussion is the following practical problem. Suppose you own a factory and you buy a machine that makes a certain product. When the machine is running normally the manufacturer of the machine guarantees that 5% or less of products will be defective.¹ As the machine runs, you would like to know if something breaks and it starts producing defective products at a rate higher than 5%. But maybe you only have time to conduct regular maintenance on the machine after every 250 products. Is there a way that you can monitor the output of the machine that will help you determine if the machine malfunctions before the next scheduled maintenance period? The answer is yes, and this is where online change point detection comes into play.

The main idea of online changepoint detection for this kind of problem² is to keep track of the number of defective products produced so far, and use that data in some way to make a decision on whether the current defective product rate is still 5% or if it has increased. The process will be monitored continuously and so we will re-evaluate all the data after each new product is produced. If our chosen method indicates that a change might have taken place, we will say that the method “raises an alarm”.

The specific method that we will look at is called CUSUM (short for “Cumulative Sum”), and it was originally proposed by E.S. Page for exactly this kind of manufacturing problem.³ It has all kinds of interesting applications, including to the problem of monitoring for adverse outcomes in medical contexts.⁴.

We’ll start by formulating our example problem in more detail. Then we’ll explain what the CUSUM method is and how to quantify type I and type II errors when using this method. This will require us to introduce the concept of the Average Run Length, which is the average amount of time before the CUSUM method raises an alarm. Finally, we’ll briefly introduce the Markov chain approach to CUSUM, which is one of the main methods used to actually perform the calculations needed when applying this method. We’ll cover the Markov chain approach to CUSUM in more detail in the next blog post.

Setup and notation

Let’s start by introducing some notation that will allow for a precise formulation of our example problem. Let \(i \in\{1,2,3,\dots\}\) enumerate the products produced by the machine as it starts running, i.e., the first product is labelled by \(i=1\), the second by \(i=2\), and so on. For each product \(i\) we define a random variable \(U_i\) that equals 0 if the product is satisfactory and 1 if the product is defective (so the \(U_i\) are Bernoulli random variables).

We’ll use \(q_0\) to represent the probability that a product is defective when the machine is working normally. When describing our example above we assumed that this probability was 5%, i.e., \(q_0 = 0.05\), but in what follows we will allow for a general value of \(q_0\). If we also use the symbol \(P_0(\cdots)\) to denote the probability of the event in parentheses when the machine is working normally, then we have (for all \(i\))

\[P_0(U_i = 1) = q_0\ .\]

We will also assume that the \(U_i\) are independent of each other when the machine is working normally, and so the \(U_i\) are i.i.d. random variables in that case.

In the CUSUM method the process is continuously monitored for changes. As we will see below, when a new product \(i\) is produced, the values of the \(U_j\) for all \(j\leq i\) are used to determine whether the machine is still running normally or if the rate of production of defective products has increased.

The CUSUM method

CUSUM works by constructing a running numerical score that monitors the quality of the output of the machine. The score is designed so that each new satisfactory product lowers the score by a small amount, while each new defective product raises the score by a larger amount. Often the score is designed so that it will average out to zero in the long run when everything is working normally.

With the score designed in this way, a long string of consecutive satisfactory products will tend to bring the score to a new minimum value. For example, imagine that the score starts at zero and then several satisfactory products are produced. Then the score now has a negative value that is lower than its initial value of zero. On the other hand, multiple defective products will tend to raise the score to new heights.

CUSUM capitalizes on this behavior. It raises an alarm (i.e., it signals a possible change in the process) when the score rises above its previous minimum value by a large enough amount, which is usually denoted by \(h\). It turns out that, even when the process is running normally, this condition will eventually be fulfilled if the process goes on long enough. Therefore, the decision on whether to take action (e.g., by stopping production and inspecting the machine) when the method raises an alarm will depend on how long it takes for the score to reach a level high enough to raise the alarm. We will discuss this point in more detail in the next section on type I and type II error control in CUSUM.

There are many ways to define a score for CUSUM, but here we will use a score suggested in the original work of Page (cited above). To use this form of the score we will assume that the probability \(q_0\) of a defective product during normal operation can be expressed as a simple rational number with numerator 1 and denominator \(n\) (assumed to be a positive integer),

\[q_0 = \frac{1}{n}\ .\]

In our original example with \(q_0 = 0.05\) we would then have \(n = 20\).

Next, for each product \(i\) we define it’s contribution \(X_i\) to the overall score as

\[X_i = n U_i – 1\ .\]

With this choice, a satisfactory product (\(U_i =0\)) gives a contribution of \(X_i = -1\), and a defective product (\(U_i =1\)) gives a contribution of \(X_i = n-1\) (with \(n=20\), this reproduces the score used in Page’s paper). In addition, when the process is running normally the average contribution of a product to the score is zero,

\[E_0[X_i] = n \left(\frac{1}{n}\right) – 1 = 0\ ,\]

where \(E_0[\cdots]\) denotes the expectation value of the quantity in square brackets when the process is running normally.

The running score \(S_i\) for CUSUM is defined as follows. First, we start the score at zero by defining \(S_0 = 0\). Next, for each \(i\geq 1\) we define

\[S_i = S_{i-1} + X_i = \sum_{j=1}^i X_j\ .\]

The score \(S_i\) is the cumulative sum of the scores for all of the products produced so far, and this is where the method gets its name.

Finally, the condition for CUSUM to sound an alarm after product \(i\) is produced is given by

\[S_i – \min_{j<i}(S_j) \geq h\ . \tag{1}\]

In the figure below we show an example of the behavior of the score \(S_i\) for the first 100 products when the process is running normally in the example with \(q_0 = 0.05\) (or \(n = 20\)).

Type I and type II error control in CUSUM

Type I and type II error control in CUSUM is a bit different than for ordinary statistical tests, although we can still clearly formulate a null hypothesis and various alternative hypotheses (depending on the specific problem at hand). In particular, normal operation of the process is our null hypothesis. However, as we mentioned above, CUSUM will eventually raise an alarm even when the process is running normally, if the process is allowed to go on long enough. Therefore, the probability of rejecting the null hypothesis when it is true is always equal to 1, and there is no obvious way to guarantee that this probability is less than a certain amount (usually denoted by \(\alpha\)) like in the usual approach to type I error control in statistical tests.⁵

Instead, the standard approach to type I and type II error control with CUSUM is to quantify the average time it will take for CUSUM to raise an alarm under the null hypothesis and any alternative hypotheses. This average time is known as the Average Run Length (ARL). When the null hypothesis is true the ARL tells us how long, on average, it will take before CUSUM raises a false alarm (it is a false alarm because the null hypothesis is true and no actual change has taken place). This is how type I errors are quantified in CUSUM. Similarly, under a particular alternative hypothesis the ARL tells us how long (again, on average) it will take CUSUM to correctly detect that the process has changed. This is analogous to the concept of statistical power, and it is how type II errors are quantified in CUSUM.

If we fix the form of the score contributions \(X_i\), then the parameter \(h\) (which determines how long CUSUM will go before raising an alarm) is the main parameter we have available to tune the ARL. Typically, one chooses \(h\) so that the ARL is sufficiently large under the null hypothesis that false alarms end up taking a long time to materialize. We’ll discuss this more in our example below.

In the example we have been discussing, the null hypothesis amounts to the assumption that the \(U_i\) are i.i.d. random variables whose probability distribution is completely specified by the condition \(P_0(U_i=1) = q_0\) (and we can think of the “0” subscript as indicating values under the null hypothesis). Various alternative hypotheses are possible. One example is to assume that the \(U_i\) are still i.i.d. random variables but the probability that a product is defective is given by \(P_1(U_i=1) = q_1\) for some value \(q_1\) that is larger than \(q_0\) (and we now use the subscript “1” to indicate values under this alternative hypothesis).

Here is one concrete way that we can use information about our specific problem to choose \(h\). Let’s go back to our example from the introduction, where we had \(q_0 = 0.05\) and we assumed that the machine undergoes regular maintenance after every 250 products produced. In this case we could chose \(h\) so that the ARL under the null hypothesis is 250.⁶ This choice guarantees that, on average, a false alarm will not occur until the regularly scheduled maintenance time of the machine. This means that, on average, we will not waste any time checking the machine prematurely when it is functioning normally.

For our definition of the score \(X_i\), with \(q_0 = 0.05\) and so \(n = 20\), choosing \(h = 63\) yields an ARL of about 255 under the null hypothesis, which is good enough for our purposes. For an alternative hypothesis with \(q_1 = 0.1=2q_0\), and using CUSUM with the same value of \(h\), the ARL is about 58, which is much less than the value under the null hypothesis.⁷ This gives us a good chance of detecting such a change long before the regularly scheduled maintenance period.

CUSUM as a Markov chain — an introduction

At this point you may be wondering how you can actually analyze CUSUM and calculate what value of \(h\) will yield the desired ARL under the null hypothesis. Indeed, in the last section we simply stated that picking \(h=63\) gives an ARL of about 255 under our null hypothesis, but did not explain how we arrived at that number. The key to doing this type of calculation is to formulate CUSUM as a Markov chain and, more precisely, as a Markov chain with an absorbing state.⁸ We’ll cover this in detail in the next post, but we’ll close this post with a quick preview of how it works.

The key to this formulation is to define a new score \(S’_i\) as follows. First, we again define \(S’_0 = 0\). Then, for any \(i > 0\) we define

\[S’_i = \begin{cases} h\ , & S’_{i-1} + X_i \geq h \\ S’_{i-1} + X_i\ , & 0 \leq S’_{i-1} + X_i < h \\ 0\ , & S’_{i-1} + X_i < 0\ . \end{cases} \]

The new score \(S’_i\) has two nice properties. First, it always takes values between 0 and \(h\) (inclusive). Second, it resets itself to 0 any time the original score \(S_i\) reaches a new minimum. To convince yourself that this second property holds, it is helpful to look at an example of both scores shown on the same plot, as we show in the figure below for the case of \(q_0 = 0.2\) and \(n=5\) (which is easier to inspect visually than the case with \(n=20\)). For this figure we chose \(h=8\) and we only show the process up until the time that CUSUM raised an alarm.

The properties of \(S’_i\) allow us to rewrite Eq. (1) for when CUSUM raises an alarm in the simple form

\[S’_i = h \ ,\]

i.e., we raise an alarm the first time \(S’_i\) reaches the value \(h\).

We can now immediately reformulate this problem in terms of a Markov chain by defining the transition matrix \(p_{s,t}\) via

\[ P(S’_{i+1} = t|S’_i = s) = p_{s,t}\ ,\]

where \(s,t\in\{0,1,2,\dots,h\}\). Then \(p_{s,t}\) is the probability that the score will transition from the value \(s\) to the value \(t\) after product \(i+1\) is produced. The transition matrix \(p_{s,t}\) will not depend on the index \(i\) if we assume (as we have been doing) that, under the null or alternative hypothesis, the probability of a product being defective does not vary from product to product (i.e., the \(U_i\) are i.i.d.).

Since CUSUM stops the first time that \(S’_i = h\), the value \(h\) is an absorbing state for the Markov chain, and so we will have \(p_{h,h} = 1\) and \(p_{h,t} = 0\) for any \(t\neq h\). In other words, in the Markov chain formulation, it is impossible to transition out of the state \(h\) once it has been reached.

Within this formulation, the ARL is equal to the average time it takes for the Markov chain to reach the state \(h\) starting from the state \(0\), also known as the expected hitting time for the state \(h\) starting from \(0\). We’ll dive into the calculation of this quantity in more detail in the next post.

Footnotes

In real life 5% is probably unacceptably large, but it’s a convenient value for our example.
A modern example of a similar problem is the following. Suppose you have an app with a video chatting feature, and you recently made changes to the app. You could use the same approach to monitor the number of calls that have technical issues (e.g., the call disconnects) so you can tell if the changes to the app have increased the probability that a call will have a technical issue.
E. S. Page, “Continuous Inspection Schemes.” Biometrika, Volume 41, Issue 1-2, June 1954, Pages 100–115, https://doi.org/10.1093/biomet/41.1-2.100
Grigg O.A., Farewell V.T., Spiegelhalter D.J., “Use of risk-adjusted CUSUM and RSPRT charts for monitoring in medical contexts.” Statistical Methods in Medical Research. 2003;12(2):147-170. https://doi.org/10.1177/096228020301200205
There are some non-obvious ways that use additional information about the specific process being monitored. We may discuss these in a separate blog post.
You could also make the ARL under the null hypothesis larger, say 1.5 x 250, to incorporate an extra buffer that will make false alarms even less likely to occur before the regularly scheduled maintenance.
It is important to note that, even if the alternative hypothesis is true, the form of the score contributions \(X_i = nU_i -1\) for CUSUM remains the same as under the null hypothesis (in particular, \(U_i\) is still multiplied by \(n = 1/q_0\)). This is because the score is designed to have zero expectation value under the null hypothesis, but a positive expectation value under any alternative hypothesis with \(q_1 > q_0\).
D. Brook, D. A. Evans, “An approach to the probability distribution of cusum run length.” Biometrika, Volume 59, Issue 3, December 1972, Pages 539–549, https://doi.org/10.1093/biomet/59.3.539