
From transcript to insights: summarizing safety culture interviews with LLMs
How well does OpenAI o1 work for summarising ‘safety culture’ interviews, and how does it compare to human notes?
This study did just that.
Extracts:
· They assessed correctness via exhaustiveness (comparison of LLM claims vs human interviewer notes), consistency (comparison of LLM claims between subsequent reports within same org), discriminant capacity (comparison of claims between different orgs)
· For initial summaries of interviews, there was a “near perfect overlap with the notes of the interviewers (95%)”
· Overlap for the final synthesis “model report (step 2) with the human-based reports was lower, with 64.3%”
· The generated LLM reports were effective for insight discrimination, and “capable of discriminating between organisations with only an overlap of 27.4% in claims”
· The outputs were consistent, being “relatively consistent with 75.7% overlap in claims between two reports generated for the same organisation”
· A “relatively high percentage of hallucinations was found (2 out 34 claims, 5.9%)”
· And notably, these specific “‘hallucinations’ did not contain falsehoods… instead they concerned an interpretation or conclusion based on previous claims generated for that theme”
· Not surprisingly, they conclude that despite the utility, “LLMs primarily assist and should not replace researchers”
· “It is not surprising that some overlap was found between the model reports of Organisation A and B. It shows that the model reports primarily consisted of organisation specific claims, while some claims are likely to be encountered in any organisation (e.g. Leadership is both visible and approachable on the work floor)”
· “Similarly, the result in relation to the consistency shows that the o1-model will replicate the majority of the claims when repeating the analysis. It is not surprising that the model reports show deviations”
· “Human researchers would also develop different reports when asked to analyse a set of transcripts twice (with no memory of the other analysis)”
· It should be noted here that an interesting challenge in interpreting these results is that the objective truth (i.e., what is the best qualitative summary of the interviews that represents the true situation at the organisation) is actually unknown. Here we worked with the assumption that the interpretation of the interviewers is the correct one, and deviations of the o1-model from their interpretations would then be problematic”
· “To leverage the differences between human and AI-driven analysis as a strength rather than a limitation, a mixed-methods approach is recommended in which both sources are treated as equally valid”
· “By systematically examining the overlap and discrepancies between the findings of the o1-model and those of human coders, blind spots can be uncovered as well as complementary insights”
· “This contrastive analysis can serve as a reflective tool, where findings identified by only one method are revisited in the raw data or discussed with subject matter experts (not involved in the project) or the interviewees themselves”
· “A relatively high percentage of hallucinations was found (2 out 34 claims, 5.9%). So even though the presence of hallucinations is problematic, they are not per se detrimental to the reliability of the generated output”
· “Especially if the interviewers act as a final barrier to avoid such hallucinations from making it into the final report”

Ref: Steijn, W., van de Loo, J., van der Beek, D., & Groeneweg, J. (2025). From transcript to insights: summarizing safety culture interviews with LLMs. MAIQS 2025.

Shout me a coffee (one-off or monthly recurring)
Study link: https://www.matec-conferences.org/articles/matecconf/pdf/2025/07/matecconf_maiqs2025_03003.pdf
Safe As LinkedIn group: https://www.linkedin.com/groups/14717868/