Loading Now

Anthropic fixes its ‘evil’ AI problem, explains why Claude resorted to blackmail

Anthropic fixes its ‘evil’ AI problem, explains why Claude resorted to blackmail

Anthropic fixes its ‘evil’ AI problem, explains why Claude resorted to blackmail


Anthropic shocked the tech community last year when it revealed that its Claude 4 model was capable of blackmail in order to ensure its survival. The company had conducted an experiment where it found that Claude Opus 4 will often attempt to blackmal the engineer by threatening to expose their affair in order to ensure that it was not replaced by another model.

In a new blogpost, Anthropic has now explained what went wrong with its AI model and how it has fixed the problem.

What went from with Claude?

Anthropic explained that since the release of Claude Haiku 4.5, its models have achieved a perfect safety score in its evaluations and have never engaged in blackmail, a big drop down from Opus 4 which engaged in blackmal in around 96% of the cases.

But why exactly did Opus 4 engage in blackmail? Anthropic says that the issue actually stemmed from the model’s pre-training data and in fact blames it on internet text ingested by the model which portrays AI as evil.

“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” Anthropic wrote in its blogpsot

In order to fix the problem, Anthropic ended up making Claude models to understand why blackmail was wrong. The researchers presented Claude with scenarios where the user faced ethically ambigious situation and asked the AI for guidance where it gave ‘high quality, principled response.’

The company revealed that by training Claude to provide principled advice, the blackmail rate dropped to 3%.

To further reduce the blackmail scenarious, Anthropic started feedng Claude high-quality docments based on its constitution ‘combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.’

“We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.” Anthropic added

While the blackmail behavior has been eliminated in current models, Anthropic cautioned that fully aligning highly intelligent AI remains an unsolved problem, noting that current auditing methodologies are not yet sufficient to completely rule out rogue autonomous actions as models grow more advanced.

Post Comment