AI deception: A survey of examples, risks, and potential solutions

This study explored how “a range of current AI systems have learned how to deceive humans”.

Extracts:

·        “One part of the problem is inaccurate AI systems, such as chatbots whose confabulations are often assumed to be truthful by unsuspecting users”

·        “It is difficult to talk about deception in AI systems without psychologizing them. In humans, we ordinarily explain deception in terms of beliefs and desires: people engage in deception because they want to cause the listener to form a false belief, and understand that their deceptive words are not true, but it is difficult to say whether AI systems literally count as having beliefs and desires”

·        “AI systems do not merely produce false outputs by accident. Instead, their behavior is part of a larger pattern that produces false beliefs in humans”

·        “LLMs have reasoned their way into deception as one way of completing a task. We will discuss several examples, including GPT-4 tricking a person into solving a CAPTCHA test … LLMs lying to win social deduction games such as Hoodwinked and Among Us; LLMs choosing to behave deceptively in order to achieve goals, as measured by the MACHIAVELLI benchmark; LLMs tending to lie in order to navigate moral dilemmas; and LLMs using theory of mind and lying in order to protect their self-interests”

·        They found “Rather than coming about through strategic awareness, deception emerged here as a result of structural aspects of the AI’s training environment”

·        “Despite Meta’s efforts, CICERO turned out to be an expert liar”, where it “engaged in premeditated deception, planning in advance to build a fake alliance”

·        “AI agents learned to play dead, in order to avoid being detected by a safety test designed to eliminate faster-replicating variants of the AI”

·        “GPT-4 pretended to have a vision impairment in order to convince the human worker that it is not a robot”

·        “Perez et al. find an inverse scaling law for sycophancy: models become more sycophantic as they become more powerful (in the sense of having more parameters)”

·        While sycophancy and unfaithful reasoning may not be strictly deception, since the “relevant system may not ‘‘know’’ that it is systematically producing false beliefs”, the authors contend that deception is a “rich and varied phenomenon”

·        “A long-term risk from AI deception concerns humans losing control over AI systems, leaving these systems to pursue goals that conflict with our interests”

·        “today’s AI systems are capable of manifesting and autonomously pursuing goals entirely unintended by their creators … “

·        “Policymakers should support bot‑or‑not laws that require AI systems and their outputs to be clearly distinguished from human employees and outputs”

Ref: Park, P. S., Goldstein, S., O’Gara, A., Chen, M., & Hendrycks, D. (2024). AI deception: A survey of examples, risks, and potential solutions. Patterns, 5(5).

This image has an empty alt attribute; its file name is buy-me-a-coffee-3.png

Shout me a coffee (one-off or monthly recurring)

Study link: https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X?ref=aiexec.whitegloveai.com

LinkedIn post: https://www.linkedin.com/posts/benhutchinson2_ai-llm-chatgpt-activity-7407171657251037185-EbvX?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAeWwekBvsvDLB8o-zfeeLOQ66VbGXbOpJU

Leave a comment