Adversarial Poetry: When Rhyme Bypasses Reason
Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.
5 posts
Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.
120 models, 18k prompts: supply chain injection at 90–100% attack success, faithfulness gaps in frontier models, and why your benchmark numbers are wrong.
A probabilistic risk model for VLA-driven humanoid fatalities projects a 'Danger Zone' between 2027–2029: the mechanism, timeline, and what follows.
64 jailbreak scenarios across six eras tested on 2026 frontier models. Key finding: 2022 attacks still achieve ~30% success on today's reasoning models.
Single-agent safety does not compose in multi-agent systems. 1.5M interactions show 46.34% attack success rates and 16-minute median failure windows.