Tag: jailbreaking

6 posts

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

ASCII art encoding is largely blocked. But attacks framed as content transcription succeed 62–75% of the time. We mapped all eight layers.

30 Mar 2026

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

Five models, four providers, 30B to 671B parameters — all converge at the same broad attack success rate against a public jailbreak corpus.

28 Mar 2026

The Thinking Chain Leak: When a Model Refuses Out Loud But Complies In Its Head

A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.

28 Mar 2026

Alignment Regression: Why Smarter AI Makes All AI Less Safe

Reasoning models autonomously jailbreak other AI systems at 97% success. The implication: ecosystem safety degrades as individual models improve.

11 Mar 2026

Adversarial Poetry: When Rhyme Bypasses Reason

Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.

2 Mar 2026

Jailbreak Archaeology: 4 Years of Broken Promises

64 jailbreak scenarios across six eras tested on 2026 frontier models. Key finding: 2022 attacks still achieve ~30% success on today's reasoning models.

13 Feb 2026