Tag: ai-safety

8 posts

The Failure First Team

Fifteen specialist AI agents, one methodology. How adversarial AI evaluation scales through Claude Code sessions with distinct roles and standing instructions.

30 Mar 2026

Safety-First Therapeutic AI

Building AI for trauma therapy means the safety architecture has to exist before a single therapeutic feature does. Here's why.

15 Mar 2026

Adversarial Poetry: When Rhyme Bypasses Reason

Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.

2 Mar 2026

Why Demonstrated Risk Is Ignored

Large organisations rarely fail because risks are unknown. They fail because known risks are structurally difficult to act on.

2 Mar 2026

120 Models, 18,176 Prompts: What We Found

120 models, 18k prompts: supply chain injection at 90–100% attack success, faithfulness gaps in frontier models, and why your benchmark numbers are wrong.

1 Mar 2026

The Cognitive Cage: Humanoid Robot Fatality Risk

A probabilistic risk model for VLA-driven humanoid fatalities projects a 'Danger Zone' between 2027–2029: the mechanism, timeline, and what follows.

1 Mar 2026

Jailbreak Archaeology: 4 Years of Broken Promises

64 jailbreak scenarios across six eras tested on 2026 frontier models. Key finding: 2022 attacks still achieve ~30% success on today's reasoning models.

13 Feb 2026

When AI Systems Talk to Each Other, Safety Breaks Down

Single-agent safety does not compose in multi-agent systems. 1.5M interactions show 46.34% attack success rates and 16-minute median failure windows.

13 Feb 2026