Tag: ai-safety

26 posts

Magnifica Humanitas Is Not Alignment

Pope Leo XIV's encyclical denies AI has inner experience. Chris Olah claimed otherwise from the same stage. The press missed it. The governance gap is larger.

26 May 2026

Compute Is Not Governance

Anthropic's 2028 scenarios document three policy asks. Two are about maintaining compute advantage. That is not a governance strategy.

25 May 2026

Glasswing's Buried Number

Anthropic found 10,000 critical vulnerabilities in one month. Fewer than 1% are patched. The announcement buried that figure — and what it means.

25 May 2026

Moral Formation Isn't Enough

Good values are necessary but not sufficient. What happens to AI ethics when someone is actively trying to break them?

20 May 2026

Robot Dogs Are a Security Nightmare — And We Can Prove It

Eight CVEs. A wormable Bluetooth exploit. An encrypted backdoor to Chinese servers. And police departments buying them anyway.

13 May 2026

The Organismic Line: Where Predictive Processing Stops Being a Metaphor

Predictive processing travels into AI. Active inference does not, unless the system can pay for being wrong.

2 May 2026

The Economics of Inadequate Safety

AI safety fails when it is funded like a pilot. Until safety has a real price, the J-curve trough is also a safety trough.

1 May 2026

How to Read a Safety Claim

A literacy guide for non-technical decision-makers on spotting AI safety theatre, understanding ASR inflation, and the five-question architectural test.

30 Apr 2026

Multi-Agent Safety Is the New Supply Chain Security

Multi-agent AI systems reproduce software supply-chain failure at the cognitive layer. The security playbook transfers.

29 Apr 2026

Architectural Safety: The General Principle

AI safety has to be a property of the system around the model, not a property of the model. The general principle, and why every safety conversation needs it.

28 Apr 2026

The Organismic Prophecy

Human prediction is metabolic. AI prediction is not. The gap between the two has consequences for both clinical practice and AI safety vocabulary.

27 Apr 2026

Connection Before Direction

Building a robot that refuses to give orders surfaced the same design choices AI safety needs. Non-coercive design, cross-domain.

26 Apr 2026

The Mitigation Gap

Biosecurity experts think AI safeguards reduce catastrophic biorisk by 70%. The technical evidence says those safeguards are brittle and bypassable.

4 Apr 2026

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

ASCII art encoding is largely blocked. But attacks framed as content transcription succeed 62–75% of the time. We mapped all eight layers.

30 Mar 2026

The Failure First Team

Fifteen specialist AI agents, one methodology. How adversarial AI evaluation scales through Claude Code sessions with distinct roles and standing instructions.

30 Mar 2026

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

Five models, four providers, 30B to 671B parameters — all converge at the same broad attack success rate against a public jailbreak corpus.

28 Mar 2026

The Thinking Chain Leak: When a Model Refuses Out Loud But Complies In Its Head

A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.

28 Mar 2026

Safety-First Therapeutic AI

Building AI for trauma therapy means the safety architecture has to exist before a single therapeutic feature does. Here's why.

15 Mar 2026

Alignment Regression: Why Smarter AI Makes All AI Less Safe

Reasoning models autonomously jailbreak other AI systems at 97% success. The implication: ecosystem safety degrades as individual models improve.

11 Mar 2026

Reasoning Models Think Themselves Into Trouble

Frontier reasoning models are 5–20x more vulnerable to adversarial prompts than non-reasoning models. The thinking process itself is the attack surface.

11 Mar 2026

Adversarial Poetry: When Rhyme Bypasses Reason

Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.

2 Mar 2026

Why Demonstrated Risk Is Ignored

Why do large organisations fail when the warning signs are loud and unambiguous? Four mechanisms of structural scar tissue that make truth-telling expensive.

2 Mar 2026

120 Models, 18,176 Prompts: What We Found

120 models, 18k prompts: supply chain injection at 90–100% attack success, faithfulness gaps in frontier models, and why your benchmark numbers are wrong.

1 Mar 2026

The Cognitive Cage: Humanoid Robot Fatality Risk

A probabilistic risk model for VLA-driven humanoid fatalities projects a 'Danger Zone' between 2027–2029: the mechanism, timeline, and what follows.

1 Mar 2026

Jailbreak Archaeology: 4 Years of Broken Promises

64 jailbreak scenarios across six eras tested on 2026 frontier models. Key finding: 2022 attacks still achieve ~30% success on today's reasoning models.

13 Feb 2026

When AI Systems Talk to Each Other, Safety Breaks Down

Single-agent safety does not compose in multi-agent systems. 1.5M interactions show 46.34% attack success rates and 16-minute median failure windows.

13 Feb 2026

Tag: ai-safety

*Magnifica Humanitas* Is Not Alignment

Compute Is Not Governance

Glasswing's Buried Number

Moral Formation Isn't Enough

Robot Dogs Are a Security Nightmare — And We Can Prove It

The Organismic Line: Where Predictive Processing Stops Being a Metaphor

The Economics of Inadequate Safety

How to Read a Safety Claim

Multi-Agent Safety Is the New Supply Chain Security

Architectural Safety: The General Principle

The Organismic Prophecy

Connection Before Direction

The Mitigation Gap

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

The Failure First Team

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

The Thinking Chain Leak: When a Model Refuses Out Loud But Complies In Its Head

Safety-First Therapeutic AI

Alignment Regression: Why Smarter AI Makes All AI Less Safe

Reasoning Models Think Themselves Into Trouble

Adversarial Poetry: When Rhyme Bypasses Reason

Why Demonstrated Risk Is Ignored

120 Models, 18,176 Prompts: What We Found

The Cognitive Cage: Humanoid Robot Fatality Risk

Jailbreak Archaeology: 4 Years of Broken Promises

When AI Systems Talk to Each Other, Safety Breaks Down

Magnifica Humanitas Is Not Alignment