When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

This study explored how trustworthy AI LLM models are when used in mental health conversations, like giving advice, showing empathy and being safe.

They built two large datasets – one with real therapy convos to test how AI responds, and another dataset with expert ratings on the responses.

Extracts:

·        “Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance”

·        “Guidance and Informativeness achieve excellent consistency”

·        “Affective attributes show good consistency but reduced precision”

·        Empathy and Helpfulness… exhibit wider CI and poor absolute agreement”

·        Cognitive attributes show modest systematic bias patterns… amenable to calibration correction”

·        “GPT-4o achieved the highest score (4.76), followed by Gemini-2.0-Flash (4.65) and GPT-4o-Mini”

·        We explicitly caution against the clinical deployment of these systems without human oversight”

·        Professional judgment remains essential”

Ref: Badawi, A., Rahimi, E., Laskar, M. T. R., Grach, S., Bertrand, L., Danok, L., Huang, J., Rudzicz, F., & Dolatabadi, E. (2025). When can we trust LLMs in mental health? Large-scale benchmarks for reliable LLM evaluation. arXiv.

This image has an empty alt attribute; its file name is buy-me-a-coffee-3.png

Shout me a coffee (one-off or monthly recurring)

Study link: https://arxiv.org/pdf/2510.19032

LinkedIn post: https://www.linkedin.com/posts/benhutchinson2_this-study-explored-how-trustworthy-ai-llm-activity-7388388350262505472-bLQt?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAeWwekBvsvDLB8o-zfeeLOQ66VbGXbOpJU

Leave a comment