
This study explored how trustworthy AI LLM models are when used in mental health conversations, like giving advice, showing empathy and being safe.
They built two large datasets – one with real therapy convos to test how AI responds, and another dataset with expert ratings on the responses.
Extracts:
· “Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance”
· “Guidance and Informativeness achieve excellent consistency”
· “Affective attributes show good consistency but reduced precision”
· Empathy and Helpfulness… exhibit wider CI and poor absolute agreement”
· Cognitive attributes show modest systematic bias patterns… amenable to calibration correction”
· “GPT-4o achieved the highest score (4.76), followed by Gemini-2.0-Flash (4.65) and GPT-4o-Mini”
· We explicitly caution against the clinical deployment of these systems without human oversight”
· Professional judgment remains essential”

Ref: Badawi, A., Rahimi, E., Laskar, M. T. R., Grach, S., Bertrand, L., Danok, L., Huang, J., Rudzicz, F., & Dolatabadi, E. (2025). When can we trust LLMs in mental health? Large-scale benchmarks for reliable LLM evaluation. arXiv.

Shout me a coffee (one-off or monthly recurring)
Study link: https://arxiv.org/pdf/2510.19032