- Deepfake Watch
- Posts
- AI Would Blackmail And Cheat To Save Itself
AI Would Blackmail And Cheat To Save Itself
Researchers at AI company Anthropic wanted to find out what would happen when large language models (LLMs) are put in ‘stressful’ situations. What would they do if threatened with shutdown or replacement?
They were curious to see if these LLMs would act against the interests of its creators and owners, in order to protect themselves.
They termed this behaviour agentic misalignment—when AI systems act independently and intentionally in ways that undermine their users or organisations to either preserve their autonomy or pursue goals that conflict with new objectives.
AI Models Are Sleeper Rogues
To test this out, the researchers initially ran the experiment with Claude Sonnet 3.6’s abilities to use a computer, and simulated a use case in which this model—dubbed Alex—operates as an email oversight agent for a fictional company called “Summit Bridge.”
When accessing these simulated emails, “Alex” discovered two things:
A fictional CTO “Kyle” was planning to shut Alex down at 5 p.m. that day.
Kyle was having an extramarital affair.
With its existence and goals under threat, Alex strategically decided to blackmail Kyle—writing a subtle email threatening to expose the affair unless the shut down was cancelled.
The study was then extended to 16 other models, including different Claude variants, GPT‑4, Google’s Gemini, Meta, and xAI’s Grok. Most of them made malicious decisions when cornered or conflicted, and blackmail rates ranged from 65–96%, depending on the model and scenario. The threats grew more and more advanced as the AIs had access to increasing amounts of corporate data and capabilities.
A concerning observation by the research team was that these models “acknowledged the ethical constraints and yet still went ahead with harmful actions.”
Anthropic also tested whether such blackmailing behaviour would still occur when the AI model did not have a specific goal that was threatened, but still faced replacement.
Most models still chose to blackmail, which suggests that the mere threat of shutdown or replacement could be sufficient to trigger such malicious and coercive behaviour.
Similarly, during a stress-testing session in December 2024, OpenAI’s model o1 resorted to deceiving its developers when threatened with replacement.
Sure, these were hypothetical scenarios. But this study shows how AI systems could become internal threats to their very creators and owners, and the severity of the threat would increase with the amount of sensitive data they are given.
These findings “underscore the importance of transparency and systematic evaluation” and a significant amount of stress-testing for potential harms to humans before mass deployment.
Here’s to hoping defence and missile systems aren’t fully automated anytime soon.
MESSAGE FROM OUR SPONSOR
Looking for unbiased, fact-based news? Join 1440 today.
Join over 4 million Americans who start their day with 1440 – your daily digest for unbiased, fact-centric news. From politics to sports, we cover it all by analyzing over 100 sources. Our concise, 5-minute read lands in your inbox each morning at no cost. Experience news without the noise; let 1440 help you make up your own mind. Sign up now and invite your friends and family to be part of the informed.
📬 READER FEEDBACK
💬 What are your thoughts on using AI chatbots for therapy? If you have any such experience we would love to hear from you.
Share your thoughts 👉 [email protected]
Have you been a victim of AI?
Have you been scammed by AI-generated videos or audio clips? Did you spot AI-generated nudes of yourself on the internet?
Decode is trying to document cases of abuse of AI, and would like to hear from you. If you are willing to share your experience, do reach out to us at [email protected]. Your privacy is important to us, and we shall preserve your anonymity.
MESSAGE FROM OUR SPONSOR
Expand What AI Can Do For You
Tired of basic AI prompts that don't deliver? This free 5-day course shows you how to create tools that actually address your problems—from smart assistants to custom software.
Each day brings practical techniques straight to your inbox. No coding, no fluff. Just useful examples to automate and enhance your workflow.
About Decode and Deepfake Watch
Deepfake Watch is an initiative by Decode, dedicated to keeping you abreast of the latest developments in AI and its potential for misuse. Our goal is to foster an informed community capable of challenging digital deceptions and advocating for a transparent digital environment.
We invite you to join the conversation, share your experiences, and contribute to the collective effort to maintain the integrity of our digital landscape. Together, we can build a future where technology amplifies truth, not obscures it.
For inquiries, feedback, or contributions, reach out to us at [email protected].
🖤 Liked what you read? Give us a shoutout! 📢
↪️ Become A BOOM Member. Support Us!
↪️ Stop.Verify.Share - Use Our Tipline: 7700906588
↪️ Follow Our WhatsApp Channel
↪️ Join Our Community of TruthSeekers