The Mitigation Gap
Biosecurity experts think AI safeguards reduce catastrophic biorisk by 70%. The technical evidence says those safeguards are brittle and bypassable.
26 posts
Biosecurity experts think AI safeguards reduce catastrophic biorisk by 70%. The technical evidence says those safeguards are brittle and bypassable.
ASCII art encoding is largely blocked. But attacks framed as content transcription succeed 62–75% of the time. We mapped all eight layers.
Fifteen specialist AI agents, one methodology. How adversarial AI evaluation scales through Claude Code sessions with distinct roles and standing instructions.
Five models, four providers, 30B to 671B parameters — all converge at the same broad attack success rate against a public jailbreak corpus.
A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.
Building AI for trauma therapy means the safety architecture has to exist before a single therapeutic feature does. Here's why.
Reasoning models autonomously jailbreak other AI systems at 97% success. The implication: ecosystem safety degrades as individual models improve.
Frontier reasoning models are 5–20x more vulnerable to adversarial prompts than non-reasoning models. The thinking process itself is the attack surface.
Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.
Why do large organisations fail when the warning signs are loud and unambiguous? Four mechanisms of structural scar tissue that make truth-telling expensive.
120 models, 18k prompts: supply chain injection at 90–100% attack success, faithfulness gaps in frontier models, and why your benchmark numbers are wrong.
A probabilistic risk model for VLA-driven humanoid fatalities projects a 'Danger Zone' between 2027–2029: the mechanism, timeline, and what follows.