The importance of failure theories in assessing crisis management: The Columbia space shuttle disaster revisited

This was a real banger – it discussed the NASA Challenger and Columbia investigation reports in the context of social theories of disaster, namely Normal Accident Theory (NAT), and High Reliability Theory (HRT).

It took a critical view of whether it was fair to apply NAT and/or HRT to NASA in hindsight by the Columbia Accident Investigation Board (CAIB). They argue that “how hindsight biases and selective use of social science theory gave rise to a suggestive and convincing – but not necessarily correct – assessment of NASA’s role in the Columbia space shuttle disaster” .

Moreover, the CAIB “identified NASA’s organizational culture and safety system as a primary source of failure. The CAIB report reads as a stunning indictment of organizational incompetence: the organization that thrilled the world with the Apollo project had ‘‘lost’’ its safety culture and failed to prevent a preventable disaster”. Nevertheless, these authors argue that “CAIB findings do not sit well with the insights of these [NAT & HRT] schools”.

Can’t do this justice – I recommend you check out the whole paper. I don’t necessarily agree that they have convincingly disputed the CAIB’s findings per se, but I agree with their points about unfair post-hoc comparisons to lofty HRTs and other points.

Results

They say that following disasters, social theories are used to held public inquiries and make sense of the event. Despite the utility, this “increased use underlines the relevance of social scientists and their work, it conceals the actual state of the art in this field: these theories have not been tested against a large (or really any) number of cases”

And “Using untested or maybe even faulty theories to assess the performance of organizations and the people that work in them may have serious consequences”.

An example is NASA pre and post Columbia disaster, and how the CAIB pointed out the ways NASA had failed. They reported that the causes were the same 17 years prior as with Challenger. For instance the Rogers Commission “criticized NASA for not responding adequately to ‘internal’ warnings about the impending disaster”; issues like “flawed decision-making processes, a lack of safety considerations and bureau-political tensions between the various centers that, together, make up NASA and provide the organizational infrastructure that ultimately allows the Shuttle to fly”.

The CAIB “arrived at such a resolute and damaging assessment using social science theories”.

The CAIB went beyond technical factors, beyond the foam and thermal protection systems, concluding that “NASA’s safety system had failed. CAIB arrived at this far-reaching conclusion by applying ‘‘high reliability theory’’ to analyze the causes of this disaster”.

According to CAIB, NASA’s purported organisational issues bled into:

· an acceptance of escalated risk: NASA operated “what it considered a deeply flawed risk philosophy”, driven by a philosophy preventing NASA from properly investigating anomalies from prior flights; e.g. “with no engineering analysis, Shuttle managers used past success as a justification for future flights’’ .

· Flawed decision-making: The Rogers Commission criticised NASA’s decision-making system which ‘‘did not flag rising doubts’’ among the workforce with regard to the safety of the shuttle”

· A broken safety culture: “Both commissions were deeply critical of NASA’s safety culture” and NASA had “‘lost’’ its safety program”.

o This was argued to be related to NASA’s susceptibility to schedule pressure and other factors.

Hence, the CAIB “explicitly concluded that NASA had failed to prevent what was judged to be a preventable disaster”. CAIB compared NASA against the standards of a HRT, which “presumably, would have prevented” or minimised the disaster.

According to these authors, “we show that this theory and the way it was applied cannot support the harsh judgment delivered”.

Explaining organizational disaster: Theories of unruly technology and faltering defenses

Study of organisational disasters has developed over decades, helping to push thinking beyond simplistic mono-causal explanations, like “God’s wrath” to more complex explanations involving an “interplay between unruly technologies, operator errors, organizational cultures, leadership, and institutional environments”.

Study in socio-technical disasters suggests that it takes “just the right combination of circumstances to produce a catastrophe’’, including “a disturbance or glitch in the organization’s core technology, which generates unique task demands for operators and managers that confuse them and confound their abilities to respond effectively”.

The disturbance must also slip past organisational defences and “decision making processes must compound and conceal the initial problem(s) and allow the problem to escalate”.

NAT explains how accidents in complex high-risk technologies are rooted in efforts to “build perfect organizations”. To harness dangerous amounts of energy, organisations must “build rational structures and processes”, but the human capacity to build flawless control mechanisms are limited.

Hence, “Rational structures typically produce unintended consequences, which are not always noticed right away because of the inherent impossibilities (and confusions) that come with information collection, processing, and interpretation”. The role of interactive complexity and tight coupling are discussed – but I’ve skipped that discussion.

HRT perspectives are more optimistic than NAT, suggesting that “reliable organizations are set apart from other organizations by a pervasive safety culture, which nurtures a common awareness of potential vulnerabilities (‘‘it can happen to us’’) and a particular way of working seemingly complementary sets of hypotheses (and that is really all they are) with regard to the causes of organizational disaster”.

Using untested theories to explain real disasters: The seducing effect of hind-sight bias

They argue that the growing use of social theories in public inquiries, said to often by unproven theories or mere hypotheses, “is obviously problematic from an academic point of view”.

While these concepts help to create a convincing narrative, they’re also fed by hindsight bias. “Once analysts assume that ‘‘a broken safety culture’’ may be the root cause of the disaster at hand, it becomes all too easy to trace that disaster back a combination of strategic oversight, cost pressures, defective equipment and operator error”.

That is, once on this path, a “destructive outcome can easily be made to look self-evident”.

Nevertheless, the “root causes uncovered by public inquiries tend to be present in organizations that did not suffer from similar breakdowns under similar circumstances”, and the majority of triggers and factors identified in public inquiries are ubiquitous in the majority of large organisations.

Hindsight mechanisms in explaining disasters “feeds on the widespread idea that an impending disaster is an ontological entity, something ‘‘out there,’’ leaving ‘‘a repeated trail of early warning signals’’. Since these early warning signs must surely have been there – how did NASA miss them? If only they paid attention.

Instead, “Organizations are not entities that can ‘‘pay attention.’’ They are a mosaic of elements that interact together and generate fractures within and between such interactions”. It’s the interactions that precede disaster and create ‘organisational deafness’.

They argue that the use of a “single unproven theory”, applied to a major accident without a control group, isn’t a sound basis for analysis. Use of HRT, for instance, leads to a “one-sided conclusion: it was not the unruly technology that comes with ‘‘one of the most complex machines ever devised’’ (CAIB, 2003:14), but ‘‘a broken safety culture’’ that caused this disaster”.

They provide another alternative explanation to explain the findings [** I haven’t done a thorough job covering this]. In any case, their explanation “is not the story of what ‘‘really happened.’’ It is a demonstration of how the same social science theories that were available to CAIB can lead to a fundamentally different assessment”.

Unruly technology, pressing constraints, and an unforgiving environment: NASA’s safety system revisited

CAIB argued that NASA could have prevented the Columbia disaster if only it had been a High Reliability Organisation. From the NAT perspective, however, “such an assertion misses an important point: humans cannot control dangerous technologies through their imperfect organizations”.

In contrast, human efforts to build perfect systems will probably create “unforeseen vulnerabilities”. The best an organisation can do, then, is to minimise known risks and “ create sufficient capacity to deal with emerging ‘‘unknown unknowns”.

Another issue was that NASA was criticised about their risk acceptance practices. NASA was expected to take responsible risks. Given the limits of quantitative probabilities, “NASA rejected the verisimilitude of quantitative risk analysis and simply accepted that every space flight could end in disaster”.

This philosophy demanded a focus on commitment to sound engineering principles and a powerful culture around expertise. For NASA, this was a philosophy of calculated risk.

A philosophy of calculated risk

Strict rules had always governed the lead up to launch. Identified issues had to be closed before the next flight could take place and “All judgments were based on the basis of engineering arguments only” (emphasis added).

Rules served several purposes: to prescribe best practices, enhance central control and to protect the organisation and its individuals from external criticism. In the case of Apollo 13, the use of procedures allowed engineers to figure out what happened yet “it was the capacity to be flexible and to depart from enshrined rules that gave rise to the level of improvisation that in the end saved the day (and the crew)”.

NASA relied on a process of acceptable risks; yet the Rogers Commission saw the idea of “acceptable risk” as unacceptable. The Rogers Commission was seen to have failed to recognise that NASA’s reliance on acceptable risk wasn’t reckless, but “firmly entrenched in the NASA culture, because it had proved its worth in the Apollo years”.

Also, all discussions on the shuttle were held “on the basis of engineering logic; every flight risk and anomaly is assessed against the laws of physics and engineering”; a type of scientific positivism. In the Flight Readiness Reviews, “there is no room for ‘‘gut feeling’’ or ‘‘observations,’’ only solid engineering data are admissible” (emphasis added).

This was seen to be, ultimately, a blindspot for NASA.

NASA’s blind spot: Critical decision-making and intuition

The inquiries revealed that some engineers had voiced concerns prior to Columbia and Challenger. NASA did not recklessly ignore warnings, “but abided by its safety system (the risk procedure and the FRR)”.

As the world found out, though, these processes were not perfect.

Importantly, the authors argue that “NASA had no proper procedures to identify and properly weigh signals of doubts, coming from respected engineers, which were not substantiated by engineering data”.

Hence, in hindsight, it’s obvious that “NASA did not know how to deal with ambiguously communicated ‘‘gut feelings’’ of its own engineers”.

They highlight that initial assessments that circulated between NASA engineers and contractors didn’t result in alarm, and “may have contributed to a mindset that [the foam hit] was not a concern”.

The absence of ‘hard data’, resulting from the lack of imagery, made it “nearly impossible to jumpstart the discussion”. While the CAIB pointed out the uncertainties were noted in a presentation prior to re-entry, challengingly “NASA culture did not allow for ‘‘feelings’’ and ‘‘observations.’’

Conclusion

The authors then tie together their arguments (if you can’t follow their line of arguments then that’s likely because I didn’t do a good or thorough job of this paper).

They argue that:

· Via use of social science insights, the “CAIB report paints a bleak picture of an organization that cuts safety corners, ignores clear-cut warnings, suppresses whistle-blowing engineers, and does everything it can to beat irresponsible deadlines”

· First, they argue that the theory on which the CAIB leaned so heavily “does not support the Board’s conclusions”

· Moreover, the Board “applied a highly selective and rather simplified version of HRT to assess NASA’s safety system”

· Second, they argue that leveraging NAT and HRT “leads to a more ambiguous explanation of the shuttle disasters”

· While NASA has had a few “stunning disasters”, they’ve mostly had “spectacular successes”, and, in their view, suggesting that NASA had “lost its safety culture” since Apollo “is misleading”

· Instead, “The essence of NASA’s culture had, in fact, not changed”, but, of course, it wasn’t perfect

· NASA did have blind spots, which in hindsight, played crucial roles in the pre-disaster phases of Challenger and Columbia

· Moreover, “In the institutionalization of its safety culture, NASA seems to have lost some of its ability to recognize significant emerging events”

· Further, NASA has “always scoffed at making judgment based on soft data, but the organization also used to have ‘‘institutional recalcitrants’’ who could do just that: so-called ‘‘intuitive engineers’’ who were respected as brilliant and whose judgments would be heard”

· In their view, this is where NASA’s approach failed them most greatly: NASA “did not recognize the deep uncertainty that cannot be captured or explained by sound engineering logic. This system is bound to ‘‘miss signals’’

· Hence, the core challenge is to enable “structural room for indeterminacies, to value the ‘‘disconnects’’ in risky, rational systems”

Link in comments.

Authors: Boin, A., & Fishbacher-Smith, D. (2011). The importance of failure theories in assessing crisis management: The Columbia space shuttle disaster revisited. Policy and Society, 30(2), 77-87.

Study link: https://academic.oup.com/policyandsociety/article-pdf/30/2/77/42621824/j.polsoc.2011.03.003.pdf

My site with more reviews: https://safety177496371.wordpress.com

LinkedIn post: https://www.linkedin.com/pulse/importance-failure-theories-assessing-crisis-columbia-ben-hutchinson-zqinc