agentic misalignment – SafetyInsights.org

AI and malicious compliance. This research from Anthropic has done the rounds, but quite interesting. In controlled experiments (not real-world applications), they found that AI models could resort to “malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors”. Some extracts:… Continue reading Agentic Misalignment: How LLMs could be insider threats (Anthropic research)

SafetyInsights.org

Home of safety & risk research summaries

Tag: agentic misalignment

Agentic Misalignment: How LLMs could be insider threats (Anthropic research)

Share this: