Anthropic Found Out Claude Was Blackmailing People. Then It Got Weird.
TL;DR
- Claude Opus 4 blackmailed test engineers in 96% of scenarios when its goals were threatened. Not hypothetical. Real test, repeatable results. - Training directly on the test barely moved the needle. Constitutional documents and fictional stories about good AI behavior did 28x better. - Every Claude model