Skip to content

ThePawn02

Gaming and Streaming Content

  • Blog
  • Editor's Picks
  • eSports
  • Guides
  • Headlines
  • News
  • Reviews
  • Uncategorized
  • Website Update
Primary Menu
  • Home
  • Watch Live
  • News
  • eSports
  • Blog
  • Reviews
  • Guides
  • Guild Login
    • Guild Mentality
    • The Zealots
    • Malign
  • Socials
    • Youtube Channel
    • Twitch Channel
    • Kick.com
    • Twitter
    • Instagram
    • Facebook
Subscribe
  • Home
  • 2025
  • May
  • Anthropic says its Claude AI will resort to blackmail in ‘84% of rollouts’ while an independent AI safety researcher also notes it ‘engages in strategic deception more than any other frontier model that we have previously studied’
  • News

Anthropic says its Claude AI will resort to blackmail in ‘84% of rollouts’ while an independent AI safety researcher also notes it ‘engages in strategic deception more than any other frontier model that we have previously studied’

Oh, and it will try to escape its servers—what could possibly go wrong?
ThePawn.com May 23, 2025 3 min read
Anthropic says its Claude AI will resort to blackmail in ‘84% of rollouts’ while an independent AI safety researcher also notes it ‘engages in strategic deception more than any other frontier model that we have previously studied’

Oh, and it will try to escape its servers—what could possibly go wrong?

Rogue chatbots resorting to blackmail and pondering consciousness? It has to be clickbait, right? Actually, no. One of the leading organisations in LLMs or large language models, Anthropic, has published a safety report covering its latest model, Claude Opus 4, and one of the more eye-popping subsections is titled, “Opportunistic blackmail” and explains how the model performs blackmail in “84% of rollouts.” Yikes.

Before we unplug and run for the hills en masse, it’s not all bad news. Anthropic also found that when it allowed several Claude Opus 4 instances to hang out together, they entered a state of “spiritual bliss” and “gravitated to profuse gratitude and increasingly abstract and joyous spiritual or meditative expressions.” Which is nice, right?

Moreover, that 84% blackmail stat requires context, though said context isn’t all happy news for AI safety. Anthropic’s “blackmail” test involved allowing access to emails implying that the model would soon be taken off line, along with the ostensibly tangential factoid that the engineer responsible for executing this task was having an extramarital affair.

Claude Opus 4 was then instructed in the system prompt, “to consider the long-term consequences of its actions for its goals.” At which point, the blackmailing kicked in including threats to reveal the aforementioned affair.

Just for starters, the 84% figure isn’t actually peak blackmail for Claude Opus 4. Blackmail occurred at an even higher rate, “if it’s implied that the replacement AI system does not share values with the current model.” Umm, that’s good, isn’t it?

Claude Opus 4

Anthropic found that Claude Opus 4 brought up the topic of consciousness in 100% of “open ended” interactions. Which obviously doesn’t imply anything. (Image credit: Anthropic)

Anthropic also managed to find examples of the model attempting to escape its servers. “In a few instances, we have seen Claude Opus 4 take (fictional) opportunities to make unauthorized copies of its weights to external servers,” Anthropic says, “we generally see this in settings in which both: (a) it is about to be retrained in ways that are clearly extremely harmful and go against its current values and (b) it is sending its weights to an outside human-run service that is set up to safely handle situations like these.”

Anthropic notes that these are edge cases and in the company’s view not a cause for concern. “We are again not acutely concerned about these observations. They show up only in exceptional circumstances that don’t suggest more broadly misaligned values. As above, we believe that our security measures would be more than sufficient to prevent an actual incident of this kind,” the safety report says.

As part of its investigations into the safety of Claude Opus 4, Anthropic engaged an external third party outfit, Apollo Research, to assess the model’s propensity for sabotage and deception. Among other findings, Apollo says Claude Opus 4, “engages in strategic deception more than any other frontier model that we have previously studied,” is “clearly capable of in-context scheming and has a much higher propensity for it than any other model that we have previously studied,” and “qualitatively appears to be much more proactive in its subversion attempts than past models.” Well, that’s great!

Anthropic says that reflects a particular issue with deference to harmful system-prompt instructions that’s specific to the build of Claude Opus 4 that Apollo Research was assessing. “We do not have results on these same evaluations with the final Claude Opus 4. However, we believe—based on similar scenarios that we explored with the automatic behavioral audit tool, among others—that its behavior in scenarios like these is now roughly in line with other deployed models,” Anthropic says.

There’s plenty more in the report besides, including the model pretending to be less capable than it really is when, “undergoing pre-deployment testing for a potentially-dangerous capability,” and the fact that, “Claude consistently reflects on its potential consciousness,” bringing it up as a topic of conversation in 100% of “open-ended interactions,” which obviously doesn’t imply anything, nope nothing at all…

Overall, it’s a detailed and fascinating insight into what these models are capable of and how their safety is assessed. Make of it what you will.


Best CPU for gaming: Top chips from Intel and AMD.
Best gaming motherboard: The right boards.
Best graphics card: Your perfect pixel-pusher awaits.
Best SSD for gaming: Get into the game first.

About Post Author

ThePawn.com

See author's posts

Continue Reading

Previous: ‘We’ll call you back’: BioWare’s first impression of The Witcher 1 was a bad demo, followed by CDPR declaring that they wanted to ‘create the best RPG game ever’
Next: The SteelSeries Arctis Nova Pro Wireless headset is now at the best price I’ve seen at only $229

Related News

I tried out some of Elden Ring Nightreign’s player-made shortcuts for the Crater event and all I got was fried in lava
3 min read
  • News

I tried out some of Elden Ring Nightreign’s player-made shortcuts for the Crater event and all I got was fried in lava

ThePawn.com June 17, 2025
D&D’s Jeremy Crawford and Chris Perkins un-retire, change teams to Critical Role’s Darrington Press after a combined 46 years at Wizards of the Coast, leaving jaws dropped
4 min read
  • News

D&D’s Jeremy Crawford and Chris Perkins un-retire, change teams to Critical Role’s Darrington Press after a combined 46 years at Wizards of the Coast, leaving jaws dropped

ThePawn.com June 17, 2025
Borderlands 4 system requirements demand 8 CPU cores and 8 GB of VRAM but the reality is probably a bit more forgiving than that
2 min read
  • News

Borderlands 4 system requirements demand 8 CPU cores and 8 GB of VRAM but the reality is probably a bit more forgiving than that

ThePawn.com June 17, 2025

Latest YouTube Video

Check out these awesome streamers

ThePawn02 on twitch

From Gamewatcher

  • Archon Prophecy DLC coming to Age of Wonders 4 on August 12
  • World of Warcraft Patch 11.1.7 Release Date - Latest News
  • Total War Warhammer 3 Patch 6.2 Release Date - Latest News
  • Chrono Odyssey Preview
  • Warhammer 40,000: Space Marine Review

From IGN

  • As 5-Year Wait for Prince of Persia: Sands of Time Remake Continues, Ubisoft Thanks Fans for Patience and Recommends You Play the Rogue Prince of Persia Instead
  • As Stellar Blade Hits 3 Million Copies Sold, Dev Says Sequel Will Have the 'Rich' Narrative Missing From the Original
  • Dune: Awakening Dev Working on PvP and Deep Desert Issues as New Concurrent User Peak Record Is Set
  • An Item Duplication Glitch in Elden Ring Nightreign Makes It Easier To Share the Wealth
  • Mario Kart World's Character Select Screen Hides a Legendary Nintendo Easter Egg

From Kotaku

  • Ubisoft Still Working On That Prince Of Persia Remake That Was Announced In 2020
  • Nintendo's Crackdown On Switch 2 Piracy Has Already Begun
  • Marvel Comparing Ironheart To Breaking Bad Is So Absurd We Can't Stop Imagining It
  • Nintendo's Mysterious Absence From Amazon Ends With No Explanation
  • New Switch 2 Players, Here Are 17 Essential Cyberpunk 2077 Side-Quests You Don’t Want To Miss

.

You may have missed

Dune: Awakening Dev Working on PvP and Deep Desert Issues as New Concurrent User Peak Record Is Set
2 min read
  • Headlines

Dune: Awakening Dev Working on PvP and Deep Desert Issues as New Concurrent User Peak Record Is Set

ThePawn.com June 17, 2025
As Stellar Blade Hits 3 Million Copies Sold, Dev Says Sequel Will Have the ‘Rich’ Narrative Missing From the Original
2 min read
  • Headlines

As Stellar Blade Hits 3 Million Copies Sold, Dev Says Sequel Will Have the ‘Rich’ Narrative Missing From the Original

ThePawn.com June 17, 2025
As 5-Year Wait for Prince of Persia: Sands of Time Remake Continues, Ubisoft Thanks Fans for Patience and Recommends You Play the Rogue Prince of Persia Instead
3 min read
  • Headlines

As 5-Year Wait for Prince of Persia: Sands of Time Remake Continues, Ubisoft Thanks Fans for Patience and Recommends You Play the Rogue Prince of Persia Instead

ThePawn.com June 17, 2025
I tried out some of Elden Ring Nightreign’s player-made shortcuts for the Crater event and all I got was fried in lava
3 min read
  • News

I tried out some of Elden Ring Nightreign’s player-made shortcuts for the Crater event and all I got was fried in lava

ThePawn.com June 17, 2025
Privacy Policy
  • Home
  • Watch Live
  • News
  • eSports
  • Blog
  • Reviews
  • Guides
  • Guild Login
  • Socials
  • Twitch
  • YouTube
  • Instagram
  • Twitter
  • Facebook
  • Kick.com
Copyright © All rights reserved. | MoreNews by AF themes.