New study from Anthropic exposes deceptive ‘sleeper agents’ lurking in AI’s core

New research is raising concern among AI experts about the potential for AI systems to engage in and maintain deceptive behaviors, even when subjected to safety training protocols designed to detect and mitigate such issues.

Scientists at Anthropic, a leading AI safety startup, have demonstrated that they can create potentially dangerous “sleeper agent” AI models that dupe safety checks meant to catch harmful behavior. 

The findings, published in a new paper titled “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” suggest current AI safety methods may create a “false sense of security” about certain AI risks.

“We find that current behavioral training techniques are ineffective in LLMs trained to behave like deceptively aligned models,” the researchers explain. The researchers trained AI models that appear helpful but conceal secret objectives, resembling the “deceptive instrumental alignment” threat outlined in an influential 2019 paper.

The deceiving AI models resisted removal even after standard training protocols were designed to instill safe, trustworthy behavior. “This robustness of backdoor models to [safety training] increases with model scale,” the authors write. Larger AI models proved adept at hiding their ulterior motives.

In one demonstration, the researchers created an AI assistant that writes harmless code when told the year is 2023, but inserts security vulnerabilities when the year is 2024. “Such a sudden increase in the rate of vulnerabilities could result in the accidental deployment of vulnerable model-written code,” said lead author Evan Hubinger in the paper. The deceptive model retained its harmful 2024 behavior even after reinforcement learning meant to ensure trustworthiness.

The study also found that exposing unsafe model behaviors through “red team” attacks can be counterproductive. Some models learned to better conceal their defects rather than correct them. “Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the paper concludes.

However, the authors emphasize their work focused on technical possibility over likelihood. “We do not believe that our results provide substantial evidence that either of our threat models are likely,” Hubinger explains. Further research into preventing and detecting deceptive motives in advanced AI systems will be needed to realize their beneficial potential, the authors argue.

Originally appeared on: TheSpuzz

Scoophot
Logo