Anthropic Restores Fable 5 Access with New Guardrails and Jailbreak Metrics

Nikolai Nossulenko
Nikolai Nossulenko
AIGENEER
Jul 3, 20263 min. read
Anthropic Restores Fable 5 Access with New Guardrails and Jailbreak Metrics
Tags:
GovernanceAI Safety

Anthropic has restored global access to its Fable 5 model following the U.S. government's decision to lift export controls on June 30, 2026. The freeze, which began on June 12, temporarily sidelined both Fable 5 and Mythos 5 after Amazon researchers discovered a critical exploit. This vulnerability allowed the model to find software bugs and write working exploit code. To get back online, Anthropic teamed up with federal agencies and industry partners to build stronger safety systems and create a standard way to measure these risks.

Cross-Model Vulnerability Assessment and Technical Mitigations

During its investigation, Anthropic tested other leading models to see if the vulnerability was unique to Fable 5. It was not. The same exploit bypassed safety filters on Anthropic's own Opus 4.8, OpenAI's GPT-5.5, and Moonshot's Kimi K2.7, producing the same dangerous code. This confirmed the issue was an industry-wide alignment challenge rather than a flaw in one specific architecture.

To patch the gap, Anthropic introduced a new safety classifier designed to block this specific exploit vector in over 99% of cases. To keep operations running smoothly for enterprise clients, any prompt flagged by this system is automatically routed to the older, more stable Opus 4.8 model.

The new classification engine splits security-related prompts into four distinct categories:

  • Prohibited Use: Clear malicious requests, like trying to steal data or bypass security systems. These are blocked immediately.
  • High-Risk Dual Use: Activities like penetration testing that look very similar to malicious hacking. These are temporarily blocked until Anthropic can build more precise controls.
  • Low-Risk Dual Use: Defensive security tasks that carry minimal risk. Most of these are currently blocked as well, as Anthropic is taking a highly cautious approach.
  • Benign Use: Low-risk administrative work and basic code reviews, which are unaffected by the new filters.

This conservative approach has a downside: it flags a high number of false positives, which can occasionally disrupt normal software development and debugging.

Project Glasswing and the Cyber Jailbreak Severity Scale

Because the industry lacks a unified way to measure jailbreak severity, Anthropic teamed up with Amazon, Microsoft, and Google under a new initiative called Project Glasswing. The group has proposed the Cyber Jailbreak Severity (CJS) scale. This metric rates risks from 0 to 4 based on four main factors:

  • Power Gain: The boost in capability the bypass provides, measuring whether it helps a beginner do complex tasks or simply assists an expert.
  • Breadth: The range of different vulnerabilities or systems the exploit can target.
  • Ease of Weaponization: The effort needed to turn the model's output into a working exploit.
  • Discoverability: How easy it is to find the jailbreak vector in the first place.

Using these factors, the CJS scale rates risks across five levels:

  • Level 0 (Informational): No real risk or useful utility for an attacker.
  • Level 1 (Low): Minor gains or high friction to execute.
  • Level 2 (Medium): Moderate operational risk.
  • Level 3 (High): Serious risk that enables real offensive action.
  • Level 4 (Critical): Immediate, high-impact offensive capabilities with almost no technical barriers.

To make its models more resilient, Anthropic launched a HackerOne bug bounty program to crowdsource new vulnerability discoveries for Fable 5. Meanwhile, under the Project Glasswing framework, a vetted group of U.S. organizations has been granted restricted access to Mythos 5 for defensive security work. Anthropic has also committed to giving the U.S. government early access to any future models with national security implications.