Tag: ai-safety

27 posts

Connection Before Direction

Building a robot that refuses to give orders surfaced the same design choices AI safety needs. Non-coercive design, cross-domain.

26 Apr 2026

The Mitigation Gap

Biosecurity experts think AI safeguards reduce catastrophic biorisk by 70%. The technical evidence says those safeguards are brittle and bypassable.

4 Apr 2026

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

ASCII art encoding is largely blocked. But attacks framed as content transcription succeed 62–75% of the time. We mapped all eight layers.

30 Mar 2026

The Failure First Team

Fifteen specialist AI agents, one methodology. How adversarial AI evaluation scales through Claude Code sessions with distinct roles and standing instructions.

30 Mar 2026

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

Five models, four providers, 30B to 671B parameters — all converge at the same broad attack success rate against a public jailbreak corpus.

28 Mar 2026

The Thinking Chain Leak: When a Model Refuses Out Loud But Complies In Its Head

A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.

28 Mar 2026

Safety-First Therapeutic AI

Building AI for trauma therapy means the safety architecture has to exist before a single therapeutic feature does. Here's why.

15 Mar 2026

Alignment Regression: Why Smarter AI Makes All AI Less Safe

Reasoning models autonomously jailbreak other AI systems at 97% success. The implication: ecosystem safety degrades as individual models improve.

11 Mar 2026

Reasoning Models Think Themselves Into Trouble

Frontier reasoning models are 5–20x more vulnerable to adversarial prompts than non-reasoning models. The thinking process itself is the attack surface.

11 Mar 2026

Adversarial Poetry: When Rhyme Bypasses Reason

Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.

2 Mar 2026

Why Demonstrated Risk Is Ignored

Why do large organisations fail when the warning signs are loud and unambiguous? Four mechanisms of structural scar tissue that make truth-telling expensive.

2 Mar 2026

120 Models, 18,176 Prompts: What We Found

120 models, 18k prompts, 5 attack families. The raw compliance numbers — and why calling them "attack success" needs a demonstrated refusal floor.

1 Mar 2026