Large language models powered system safety assessment: applying STPA and FRAM

An AI, STPA and FRAM walk into a bar…ok, that’s all I’ve got.

This study used ChatGPT-4o and Gemini to apply STPA and FRAM to analyse:

“liquid hydrogen (LH2) aircraft refuelling process, which is not a well- known process, that presents unique challenges in hazard identification”.

One of several studies applying LLMs to safety analysis (FRAM, STPA and more).

Some extracts:

· “both LLMs demonstrated weaknesses in their analyses, with ChatGPT generally outperforming Gemini regarding response comprehensiveness and adhering to the prompted format”

· “LLMs failed to use systems thinking in their stand-alone applications and failed to follow up on previous prompt outputs”

· “While LLMs can provide substantial amounts of information quickly, the effectiveness of LLMs in system safety assessment is contingent on addressing their limitations and implementing strategies to improve their capabilities “

· “Overall …ChatGPT performed better than Gemini regarding following FRAM and STPA steps”

· “Gemini’s architecture focuses on short, direct and general responses due to its emphasis on factual precision and brevity. In contrast, ChatGPT tends to be more verbose and provides greater detail, with a better capability to provide tailored responses”

· “Our findings revealed that while Gemini provided greater details on some FRAM analysis tasks, ChatGPT provided a more comprehensive and detailed analysis of most STPA outputs”

· “both [LLMs] failed in identifying all couplings between functions … These missing interdependencies illustrate a core limitation in the models’ ability to adopt a systems-thinking approach, which requires understanding how functions interact dynamically rather than operating in isolation”

· “ This suggests that LLM chatbots might perform better with traditional tools, as the existing literature and practice are predominantly based on traditional safety assessment tools and techniques. Consequently, LLM will prioritise the established traditional thinking and application”

· “However, Sujan et al. (2025) found that the model outputs could consider the interactions between functions and consider systems thinking when prompted accordingly”

· “With FRAM applications, it is crucial to analyse how things could go right and identify potential trade-offs made by the operators to ensure safe operations. In this study, both models failed to reveal those trade-offs”

· “However, it should be acknowledged that the case study was challenging as there is no everyday practice on refuelling of LH2-powered aircraft. Yet, ChatGPT could still generate variability examples of being on time, acceptable, and precise, which leads to considerations of how things could go right”