Improving Construction Site Safety with Large Language Models: A Performance Analysis

2nd Mar 2026

This preliminary, proof of concept study explored the effectiveness of GPT-4o in construction visual hazard recognition. They contrasted performance against OHS experts.

Source was **static images** from Google and real construction sites (not real-time video analysis).

The LLM & experts were asked to rate the hazard, justify their judgement, and assess the immediate issues, use of PPE, and compliance with regulations.

Shared under open access.

** PS. Check out my YouTube channel. Link in comments **

Extracts:

“results indicate that the model can serve as a valuable decision-support tool for safety professionals by providing scalable and real-time insights”
“With an overall accuracy of approximately 69% and sensitivity near 70%, the model shows a moderate capability in identifying hazardous situations, even without domain-specific fine-tuning”
“However, the study also highlights key limitations, including the model’s reliance on general visual features rather than domain-specific safety knowledge, and the continued need for human supervision”
“Additionally, ethical concerns, including bias in AI-generated hazard assessments, data privacy, and the risk of over-reliance on AI, must be carefully managed to ensure these tools contribute responsibly and effectively to proactive risk management strategies”
“it is evident that [human] experts possess a comprehensive understanding of regulations, technical requirements, and potential hazards that are not always evident through superficial observation”
“For example, the failure of the crane hook or the potential for a wooden block to slip—due to e.g., subtle ground slopes, micro-vibrations, or suboptimal support positioning” etc.
“In contrast, GPT-4o primarily relies on the visual information available. As a result, the model seems unable to infer the risks that professionals can suggest from their extensive experience and the detailed analysis of each phase of the work process”
“Consequently, the model tends to emphasize reassuring elements—such as the presence of PPE—while overlooking less obvious yet potentially hazardous aspects (i.e., false negative cases)” [** though, this is likely amenable with more precise prompting and constraints]
“Moreover, GPT-4o does not account for industry-specific regulations, leading to an analysis that is more superficial, optimistic, and limited to the explicitly provided information, as aforementioned” [** again, as above]
“One of GPT-4o’s key strengths lies in its ability to rapidly process and analyze large volumes of data, including images and textual descriptions. This capability makes it particularly valuable in complex work environments such as construction sites, where the timely identification of potential hazards is crucial”
They discuss the ethical considerations of use of such systems, particularly real-time video systems
Of course, they argue for a role of enhancing human expertise, and integration with real-time video monitoring and the like

information-17-00210-v2 Download

Published by Ben Hutchinson

View all posts by Ben Hutchinson

SafetyInsights.org

Home of safety & risk research summaries

Improving Construction Site Safety with Large Language Models: A Performance Analysis

Published by Ben Hutchinson

Leave a comment Cancel reply

Share this:

Related

Published by Ben Hutchinson

Leave a comment Cancel reply