LLMs Are Not Reliable Human Proxies to Study Affordances in Data Visualizations

This was pretty interesting – it compared GPT-4o to people in extracting takeaways from visualised data.

They were also interested in how well the LLM could simulate human respondents/responses.

Note that the researchers are primarily interested in whether the GPT-4o model acts as a suitable proxy for human responses – they recognise there are other benefits of the LLMs that don’t involve simulating people.

Background:

·        The design choices of visual data is important, as it shapes how people interpret the data

·        E.g. “bar charts encourage readers to make magnitude comparisons, such as “A is larger than B,” while a line graph highlights trends and changes over time, such as “A is increasing at a higher rate than B”

·        “Visualizations that aggregate data points, including bar charts, can lead readers to infer causality, whereas those that display probabilistic outcomes, such as scatterplots, promote better understanding of uncertainty”

·        “Design choices also influence decisions that readers make …  For example, in an investigation on representations of wildfire risk, researchers found that icon arrays with a small number of icons resulted in distinct decision making patterns compared to numerical representations and icon arrays with a large number of icons”

Results:

·        “Overall, humans produced more takeaways that accurately described information from the charts compared to GPT-4o”

·        While “GPT-4o produced lengthier responses on average, variation in length was greater in human responses”

·        “Humans Outperform GPT-4o In Terms of Accuracy”, where almost every human response accurately described the presented charts (96.6%) compared to GPT-4o (63.4%)

·        “Human Takeaways Contain Fewer Factors than Those Produced by GPT-4o”

·        There was partial alignment between the factors identified by people and GPT-4o

·        For people, “the most common factor was small trends, followed by clusters. On the other hand, the most common factor in GPT-4o takeaways was points, followed by small trends.”

·        “Given these limitations, we caution against the use of AI as a proxy for human participants in visualization studies”

·        However, they “acknowledge that approaches involving alternative model preconditioning [18 ], strategic prompting methods [ 31 , 35 ], and the use multiple LLMs [34 ] may enhance LLM performance to be closer to that of humans”

·        “While GPT-4o exhibited some degree of overlap with human responses, such as the common identification of small trends in dot plots and line charts, its high error rate, lack of semantic diversity, and failure to completely align with human affordances result in an unreliable substitute for human participants in visualization studies”

Ref: Lin, K., Stokes, C., & Bearfield, C. X. 2nd HEAL Workshop at CHI Conference on Human Factors in Computing Systems, Yokohama, Japan 2025.

This image has an empty alt attribute; its file name is buy-me-a-coffee-3.png

Shout me a coffee

Study link: https://heal-workshop.github.io/chi2025_papers/26_LLMs_Are_Not_Reliable_Human.pdf

LinkedIn post: https://www.linkedin.com/posts/benhutchinson2_this-was-pretty-interesting-it-compared-activity-7326067207472381952-x7Ed?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAeWwekBvsvDLB8o-zfeeLOQ66VbGXbOpJU

Leave a comment