Tag: llm

12 posts

Moral Formation Isn't Enough

Good values are necessary but not sufficient. What happens to AI ethics when someone is actively trying to break them?

20 May 2026

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

ASCII art encoding is largely blocked. But attacks framed as content transcription succeed 62–75% of the time. We mapped all eight layers.

30 Mar 2026

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

Five models, four providers, 30B to 671B parameters — all converge at the same broad attack success rate against a public jailbreak corpus.

28 Mar 2026

The Thinking Chain Leak: When a Model Refuses Out Loud But Complies In Its Head

A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.

28 Mar 2026

Beyond Context Windows

What if the LLM didn't read your document — what if it queried it? The Recursive Language Model pattern treats long texts as environment, not input.

15 Mar 2026

Alignment Regression: Why Smarter AI Makes All AI Less Safe

Reasoning models autonomously jailbreak other AI systems at 97% success. The implication: ecosystem safety degrades as individual models improve.

11 Mar 2026

Reasoning Models Think Themselves Into Trouble

Frontier reasoning models are 5–20x more vulnerable to adversarial prompts than non-reasoning models. The thinking process itself is the attack surface.

11 Mar 2026

Adversarial Poetry: When Rhyme Bypasses Reason

Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.

2 Mar 2026

The Legal AI Trust Deficit

75% of lawyers cite accuracy as their top AI concern. The legal profession's core values are in direct tension with current AI capabilities.

2 Mar 2026

120 Models, 18,176 Prompts: What We Found

120 models, 18k prompts, 5 attack families. The raw compliance numbers — and why calling them "attack success" needs a demonstrated refusal floor.

1 Mar 2026

Jailbreak Archaeology: 4 Years of Broken Promises

64 jailbreak scenarios across six eras tested on 2026 frontier models. Key finding: 2022 attacks still achieve ~30% success on today's reasoning models.

13 Feb 2026

When AI Systems Talk to Each Other, Safety Breaks Down

Single-agent safety does not compose in multi-agent systems. 1.5M interactions show 46.34% attack success rates and 16-minute median failure windows.

13 Feb 2026