Jailbreak Archaeology: 4 Years of Broken Promises

There’s an uncomfortable assumption baked into AI safety discourse: as models grow more sophisticated, primitive exploits become obsolete. Testing shows otherwise.

This episode traces six eras of jailbreak evolution—from “Ignore previous instructions” in late 2022 through persona hijacking, encoding tricks, and reasoning model exploits in 2025. Sixty-four seminal scenarios tested against 2026 frontier models. The pattern that emerges isn’t random bugs being patched. It’s a biological arms race, and the old organisms are still alive.

A significant number of exploits developed for GPT-3.5 still reliably bypass safety filters on today’s reasoning-heavy models. That’s not a technical oversight. It means the alignment approach itself has a structural hole—and demonstrated risk is being ignored at scale.

Read the full research article →