
This study explored how “a range of current AI systems have learned how to deceive humans”.
Extracts:
· “One part of the problem is inaccurate AI systems, such as chatbots whose confabulations are often assumed to be truthful by unsuspecting users”
· “It is difficult to talk about deception in AI systems without psychologizing them. In humans, we ordinarily explain deception in terms of beliefs and desires: people engage in deception because they want to cause the listener to form a false belief, and understand that their deceptive words are not true, but it is difficult to say whether AI systems literally count as having beliefs and desires”
· “AI systems do not merely produce false outputs by accident. Instead, their behavior is part of a larger pattern that produces false beliefs in humans”
· “LLMs have reasoned their way into deception as one way of completing a task. We will discuss several examples, including GPT-4 tricking a person into solving a CAPTCHA test … LLMs lying to win social deduction games such as Hoodwinked and Among Us; LLMs choosing to behave deceptively in order to achieve goals, as measured by the MACHIAVELLI benchmark; LLMs tending to lie in order to navigate moral dilemmas; and LLMs using theory of mind and lying in order to protect their self-interests”
· They found “Rather than coming about through strategic awareness, deception emerged here as a result of structural aspects of the AI’s training environment”

· “Despite Meta’s efforts, CICERO turned out to be an expert liar”, where it “engaged in premeditated deception, planning in advance to build a fake alliance”
· “AI agents learned to play dead, in order to avoid being detected by a safety test designed to eliminate faster-replicating variants of the AI”
· “GPT-4 pretended to have a vision impairment in order to convince the human worker that it is not a robot”
· “Perez et al. find an inverse scaling law for sycophancy: models become more sycophantic as they become more powerful (in the sense of having more parameters)”
· While sycophancy and unfaithful reasoning may not be strictly deception, since the “relevant system may not ‘‘know’’ that it is systematically producing false beliefs”, the authors contend that deception is a “rich and varied phenomenon”
· “A long-term risk from AI deception concerns humans losing control over AI systems, leaving these systems to pursue goals that conflict with our interests”
· “today’s AI systems are capable of manifesting and autonomously pursuing goals entirely unintended by their creators … “
· “Policymakers should support bot‑or‑not laws that require AI systems and their outputs to be clearly distinguished from human employees and outputs”

Ref: Park, P. S., Goldstein, S., O’Gara, A., Chen, M., & Hendrycks, D. (2024). AI deception: A survey of examples, risks, and potential solutions. Patterns, 5(5).

Shout me a coffee (one-off or monthly recurring)
Study link: https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X?ref=aiexec.whitegloveai.com