
This paper compares real-time or active monitoring of safety incidents via cumulative sum (CUSUM) control charts.
It’s an interesting read, and if you’re not savvy with control charts or statistics then just skip over the technical stuff. Some of the findings are still pretty relevant.
In saying that, I’ve skipped most of the dense statistical discussions from the paper.
I’m a big fan of control charts. While I recognise the many limitations of incident data (and especially given their statistical rarity, high degree of randomness, and influence from reporting definitions etc.), I also think that we can put them to better use since we normally need to track them anyway.
This paper explores CUSUM control charts and also the time-between-events, which is the time between successive incidents.
They also compare different reporting timeframes for the aggregated incident counts. E.g. they run simulations of incidents and compare performance between CUSUM control charts of charts with incident aggregated incident counts of quarterly, monthly, semi-monthly, weekly, every-three-days, and daily.
Whereas control charts were originally designed and used primarily in the quality sphere and in manufacturing environments, they’ve been applied elsewhere including healthcare and more. They’ve also been used quite a bit for investigating traffic accidents.
Traditional control charts are said to use a baseline data set to establish the chart parameters. E.g. an average incident or breakdown value from 6-24 months. When the chart value surpasses some established upper limit (2 or 3 standard deviations above the average), this signals that the process if out of control.
The Poisson CUSUM chart incorporates more data and is said to be more sensitive to small shifts in underlying parameters, and better able to detect changes.
They argue that “The values of counted data, such as the number of nonconforming parts or the number of accidents, often follow a Poisson distribution (Lucas, 1985). Therefore, a Poisson CUSUM chart is appropriate for monitoring counted data”.
[** Some research has also supported the argument that incident data also seems to follow Poisson or negative binomial distributions. Tahira Probst published a recent study which found the same.]
An alternative to aggregated incident counts (e.g. 9 MTIs over the month), is to chart the time between events. This monitors the amount of time that has elapsed between the current event and the immediately prior event.
As advantage of this approach is that “rather than waiting until the end of a fixed time period for aggregated count data, the information is incorporated as it is obtained during the process”.
Moreover, “When monitoring the occurrence of rare adverse events (such as nonconforming parts, surgical errors, or industrial accidents) an increase in the time between events is desirable and indicates an improvement in the process”.
[** Again, I’m a little sceptical and cautious about trying to over-analyse incident data, nor what it’s supposed to indicate, but again think it can be used more valuably]
They briefly discuss some applications of binomial samples vs Bernoulli data, and how some healthcare research has used Poisson and binomial aggregate data for tracking birth defects. Interestingly, in that example the use of Poisson and binomials were found to “reduce the efficiency of control charts”.
Results
The figure below highlights some applications of control charts using the CUSUM. I’ll skip the more specific applications that they show in other figures.

Based on these simulations, they observe that (quoting the paper below):
· quarterly aggregation produces a signal after the first quarter (month 3, after the 23rd accident)
· monthly after month 3 (23 incidents)
· semi-monthly during the middle of month 3 (11 weeks, 19 accidents)
· weekly after 10 weeks (16 accidents)
· every 3 days after 10 weeks (16 incidents)
· daily during the 10th week (the 15th accident)
With real-time data, or close to it, like found in the daily counts, the CUSUM statistic should be reset following the resolution of the issue/hazard control. They didn’t show the resetting step here.
They say it’s important to carefully choose the length of the monitoring period when using aggregated Poisson count data. Longer monitoring periods, as are often used (e.g. 1, 3 or 6 months) “lead to a higher degree of aggregation and a subsequent loss of information”.
You’ll likely find a similar phenomenon with your own internal rolling average transformation (which smooths out fluctuations).
The trade-off they argue is that waiting longer periods for more aggregated data risks a lower detection signal for trends.
Their findings indicate that “charts using data from shorter monitoring periods detect increases in accident frequency more quickly”. And in their example, “daily Poisson CUSUM chart signals an increase in the number of safety incidents sooner than any of the charts with longer monitoring periods”.
Another trade-off here is while shorter reporting periods allows closer to real-time monitoring, it also comes with a time cost; somebody has to aggregate this data and update the charts. Of course, this is something which can be readily automated via software workflows.
They provide a bunch more info around time between incidents if multiple events happen on the same day, which I’ve skipped. Nevertheless, aggregating by day is likely the shortest reasonable level of aggregation for most companies.
They state that based on their simulated results, “better performance [results from] shorter periods of data aggregation and real-time time-between-accidents”.
They point out several limitations of these aggregated methods, too. Again I’ve skipped this.
They say that “In manufacturing systems it is assumed that the process would be stopped following a control chart signal, the machine recalibrated or other action taken, the process resumed, and the CUSUM monitoring statistic reset to zero”.
While this works for manufacturing systems, they recognise it doesn’t work that cleanly with social systems/work.
Work cannot necessarily be entirely ‘stopped’ to resolve the issue as it can with manufactured systems. In their modelling examples, they assumed a turn-around of 30 days to resolve the hazard before resetting the control chart statistic.
They also highlight other reporting timeframes. A 10-day control implementation period is compared to 30 days (as above), leading to quicker follow-up signals and improvement opportunities.
Hence, as expected, shorter reporting and resolution periods leads to more rapid feedback and monitoring, but also comes with potentially and significantly higher time and resources to manage.
Their findings suggest that the collection and analysis of incident data at the highest practicable frequency provides the greatest opportunity for effective mitigation of hazards.
They briefly discuss expanding the sample beyond simply incidents/injuries to also near hits. This logic can be further expanded with other proxies connected to incidents, critical control failures, Go/No Go decisions and other items of interest.
Ref: Anna Schuh, Jaime A. Camelio & William H. Woodall (2014) Control charts for accident frequency: a motivation for real-time occupational safety monitoring, International Journal of Injury Control and Safety Promotion, 21:2, 154-162.

Study link: https://doi.org/10.1080/17457300.2013.792285