Anthropic fixes its 'evil' AI problem, explains why Claude resorted to blackmail

Technology India Finance May 10, 2026 0 Comments

Anthropic fixes its ‘evil’ AI problem, explains why Claude resorted to blackmail

Anthropic shocked the tech community last year when it revealed that its Claude 4 model was capable of blackmail in order to ensure its survival. The company had conducted an experiment where it found that Claude Opus 4 will often attempt to blackmal the engineer by threatening to expose their affair in order to ensure that it was not replaced by another model.

In a new blogpost, Anthropic has now explained what went wrong with its AI model and how it has fixed the problem.

What went from with Claude?

Anthropic explained that since the release of Claude Haiku 4.5, its models have achieved a perfect safety score in its evaluations and have never engaged in blackmail, a big drop down from Opus 4 which engaged in blackmal in around 96% of the cases.

But why exactly did Opus 4 engage in blackmail? Anthropic says that the issue actually stemmed from the model’s pre-training data and in fact blames it on internet text ingested by the model which portrays AI as evil.

“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” Anthropic wrote in its blogpsot

In order to fix the problem, Anthropic ended up making Claude models to understand why blackmail was wrong. The researchers presented Claude with scenarios where the user faced ethically ambigious situation and asked the AI for guidance where it gave ‘high quality, principled response.’

The company revealed that by training Claude to provide principled advice, the blackmail rate dropped to 3%.

To further reduce the blackmail scenarious, Anthropic started feedng Claude high-quality docments based on its constitution ‘combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.’

“We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.” Anthropic added

While the blackmail behavior has been eliminated in current models, Anthropic cautioned that fully aligning highly intelligent AI remains an unsolved problem, noting that current auditing methodologies are not yet sufficient to completely rule out rogue autonomous actions as models grow more advanced.

Anthropic fixes its ‘evil’ AI problem, explains why Claude resorted to blackmail

Post Comment Cancel reply

Do Not MIss

Japan’s exports hit record high, but trade deficit continues

India, Korea sign pact for exchange of notes for NCERT’s Technical Cooperation project

India has good reason to help Trump erect a border wall

Sony PS6 launch timeline tipped: Performance, graphics and everything to expect

Travel and Hospitality sector in India projecting 8.2 pc net employment change in HY2 FY2025: TeamLease Report

India examining US trade memo; directive on bilateral trade deal positive: Report

Global tourism almost return to pre pandemic level in 2024

US President Donald Trump announces $500 billion AI initiative

Amazon Fab Phone Fest is now LIVE! Get Samsung Galaxy S23 Ultra, Honor 200, and more at up to 52% discount

Trump eyes 10 % tariff on China over fentanyl exports to Mexico, Canada

Anthropic fixes its ‘evil’ AI problem, explains why Claude resorted to blackmail

Related Posts

Post Comment Cancel reply

Do Not MIss