
Are our use of reported injury measures, like TRIFR or LTIFR, ‘good enough’ representations, or beset with foundational statistical flaws?
Today’s report is from Hallowell et al., 2020, titled ‘The Statistical Invalidity of TRIR as a Measure of Safety Performance’. From the CSRA.
Make sure to subscribe to Safe AF on Spotify/Apple, and if you find it useful then please help share the news, and leave a rating and review on your podcast app. I also have a Safe AF LinkedIn group if you want to stay up to date on releases.
Join my Safe AF LinkedIn group: https://www.linkedin.com/groups/14717868/

Shout me a coffee (one-off or monthly recurring)
Transcription:
Your safety dashboard looks great, nearly everything is green or amber, with barely a hint of red.
And then you kill somebody.
Our injury measures more a curse with a poor statistical basis or a decent enough predictor of harm.
G’day everyone, I’m Ben Hutchinson and this is Safe As,
a podcast dedicated to the thrifty analysis of safety, risk and performance research.
Visit safetyinsights.org for more research.
This well-known report from Hallowell et al.
as part of the CSRA, studied the statistical basis of TRIR, or Total Recordable Injury Frequency Rate.
It’s titled The Statistical Invalidity of TRIR as a Measure of Safety Performance.
So what is the TRIR TRIFR?
So TRIR, the Total Recordable Injury Frequency Rate,
is the rate at which a company experiences an OSHA recordable incident
scaled per 200,000 worker hours.
In other places around the world, it’s scaled against 1 million worker hours and probably other
variations.
So this helps normalize the value to account for different working hours and,
by extension, different head counts.
Now even though the report refers to TRIR, I’m going to call it TRIFR,
since that’s what we call it in the land of Oz.
What were their methods?
The researchers had direct access to over 3 trillion worker hours of
internally reported incident data from partner organizations within the US,
covering a 15-year period.
The data set included monthly counts of recordable injuries, fatalities and worker hours.
They used a range of statistical analytical approaches,
both parametric and non-parametrical tests.
And they served different purposes.
Just very quickly, the parametric approach.
In particular, the Poisson distribution identified recordable incidents as discrete events,
which means that they either occurred or didn’t occur,
and recognized that they vary over individual worker hours.
Therefore, the TRIFR was logically represented as a series of Bernoulli trials,
and the Poisson distribution was deemed the most accurate representation.
This is pretty appropriate for modeling these types of discrete events.
In particular, discrete rare events that occur over a fixed interval of time.
This really well describes recordable injuries within worker hours.
They also used some non-parametric approaches,
but I really recommend you check out the report to get a description around
what these tests are and why they use them.
The reason why I jumped into that background around the parametric test before,
particularly the Poisson distribution,
is that several people on social media were quick to argue
that the Poisson distribution in this report isn’t appropriate for this type of modelling,
and that the authors should have used a negative binomial distribution.
First of all, some existing research has already shown that incidents
are appropriately modelled via the Poisson distribution,
and that the assumption of independence isn’t actually violated.
I’m not going to go into that here though.
Further, the applied statistics in OHS textbooks from Janissak,
specifically describe the appropriateness of Poisson distributions for this purpose,
because it’s well geared for statistically rare events compared to the total exposure.
However, a recent review study of best practices for under-reporting accident research
from Brazilian probes in 2025 in safety science found that while both Poisson and binomials were
appropriate for this task, in fact they even argued that linear Poisson and negative binomial models
returned similar results in medium and large sample sizes, showing that it actually is appropriate.
However, based on that research, negative binomials did return a higher accuracy.
But again, Poisson was a good estimator of negative binomials,
and is appropriate for statistically rare events,
which is what the statistician Janissak in his OHS statistics textbooks,
and Helwell et al in this current report actually argued to begin with.
Anyway, I’ll get off my soapbox now.
So what did they find?
Well, let’s jump into the core findings of this report.
No links between the TRIFR and serious accidents from fatalities was found.
There was no discernible statistical association between TRIFR and fatalities.
Therefore, the trends in the TRIFR isn’t statistically associated with fatalities,
suggesting that they happened for different reasons.
And because of this lack of association,
TRIFR are not a proxy for high impact incidents.
And further, the authors argue that all of these safety activities
associated with improving TRIFR performance may not necessarily help to prevent fatalities.
The results indicated that changes in TRIFR are due to 96-98% random variation.
The authors discussed that recordables don’t occur in predictable patterns,
and it’s likely because safety is a complex phenomenon impacted by many different factors.
The models were tested to see if historical TRIFR performance
predicted future TRIFR performance.
Found was that at least 100 months of data was needed for reasonable predictive power.
It was argued that because TRIFR is normally used to make monthly or annual comparisons,
this finding around 100 months of data, as like the basic level needed,
indicates that for all practical purposes,
TRIFR is not predictive in the way that it’s used.
In plain language, injuries are largely due to chance,
and don’t follow a consistent statistical pattern.
But what does random chance or chance mean in this context?
Because it’s been taken out of context by some practitioners,
meaning that there’s no underlying causal contributing factors.
Is that what we’re arguing?
That there are no causes in life.
That’s actually incorrect.
We’re talking about statistical randomness,
not some ontological or positivistic sense of the word.
What this means is that the occurrence of recordable injuries
doesn’t follow predictable patterns or occur at regular discernible intervals.
So while a safety system aims to reduce risk,
the actual manifestation of incidents over short periods is highly variable,
and statistically behaves like a random process.
There’s also the issue of statistical noise.
Given that 96 to 98% of the variation in TRIFR was due to random variation,
this implies that when you see an organization’s TRIFR go up or down from one period to the next,
it’s overwhelmingly likely to be statistical noise,
rather than the direct reflection of some sort of improvement
or deterioration in the underlying safety system.
You may as well throw dice to predict the next accident,
based on tracking these data.
And because they’re quite relatively rare events,
this infrequency contributes to the high degree of random variation within TRIFR,
especially when measured over typical short timeframes, month or years.
So in other words, this doesn’t mean there’s no causality or underlying contributing factors.
It highlights that many causal factors at play in a complex system
lead to recordable incidents that are, at a sort of macro statistical level
over typical reporting periods, highly unpredictable and random.
All right, so let’s move on to the next finding, three.
The TRIFR isn’t precise.
It lacks precision and shouldn’t be communicated with multiple decimal points.
Unless hundreds of millions of worker hours are amassed,
the confidence intervals are just too wide for TRIFR
to report it accurately to even one decimal point.
And perhaps my favorite part of this paper
follows that where it’s shown that if you were to report TRIFR to two decimal places,
for instance, you have a TRIFR of 1.29,
you would need approximately 30 billion worker hours of data to support that claim.
So on this point, the authors state that
TRIFR for almost all companies is virtually meaningless
because they do not accumulate enough worker hours.
Four, TRIFR is statistically invalid for comparisons.
In nearly every practical circumstance, it’s statistically invalid to use
TRIFR to compare companies or business units, projects or teams.
Because most companies, again, don’t accumulate enough worker hours
to detect statistically significant changes.
On this point, the authors state that
TRIFR shouldn’t be used to track internal performance or compare companies, etc.
And TRIFR cannot be a single number by extension
because if TRIFR is largely random,
a single number doesn’t represent the true sort of reflection of safety performance.
Instead, TRIFR should be expressed as a confidence interval,
a range of potential values over extended periods.
Therefore, single-point estimates of TRIFR broken down to decimal places
over really short periods of time, months to years,
are said to be statistically meaningless for almost every organization.
In plain language, instead of reporting our TRIFR as 1.29,
companies should at least report a range like our TRIFR is likely between 1 and 4.
The next key finding is that TRIFR is only predictive over very long periods.
Predictive of when you have at least 100 months, more than 8 years of data.
So for practical purposes, given typical reporting periods in the
month to month, it’s not predictive.
So therefore, what your TRIFR was last month or last year
probably doesn’t tell you much about what it’ll be next month or next year,
unless you have many years of data behind it.
And even by the stage when you have enough data,
it may not be relevant anymore to your current operations.
Therefore, TRIFR is inadequate for measuring intervention impact.
It’s entirely inadequate for attributing changes in safety to specific interventions or investments.
What can we make of the findings?
Well, first of all, a misconception is that the measures themselves are invalid.
Therefore, you know, we’re saying TRIFR is somehow invalid,
but this isn’t actually what the report indicated.
It indicates that the statistical basis of how these measures are used are typically invalid.
Again, using the example of dice, you may as well throw dice.
We’re making causal claims about getting two sixes when, statistically,
we can’t differentiate the sixes from chance.
We assume that some sort of causal factor of why we keep throwing sixes,
but statistically, we can’t demonstrate that.
In any case, some practical implications, I think,
maybe instead of thinking about abandoning these measures,
some efforts to help them suck less is probably warranted.
And the report already gives some suggestions.
Use a range instead of single-point estimate.
Carve off those decimals, one instead of 1.1.
Decouple the measures from decision-making.
For instance, they say if an organization is using
TRIFR for performance evaluations,
then they’re likely rewarding nothing more than random variation.
Find ways to increase the sample size and hence the statistical power,
the time, the numbers that you’re evaluating, etc.
Don’t use them to track internal performance,
or at least between projects or teams.
By extension, maybe caution when used for gauging contractor performance,
and their use in tendering.
And, of course, their use in incentive programs.
Also, I like how the researcher David Oswald suggested on
coupling quantitative indicators with qualitative indicators.
For instance, every number that you present should have a qualitative descriptor, a narrative.
One gives you the what, that’s the number.
The other gives you the rich narrative on how and why it actually matters.
There’s also other tools to help improve the use of injury measures.
Control charts are a favorite of mine.
I use control charts for lots of stuff,
and they incorporate Poisson and negative binomial thinking.
There’s also a range of statistical methods,
like significance testing, confidence intervals, and more.
Anyway, that’s it on Safe As.
I’m Ben Hutchinson.
Please help share, rate and review,
and check out safetyinsights.org for more research.
Finally, feel free to support Safe As by shouting a coffee.
Link in the show notes.