*Magnifica Humanitas* Is Not Alignment
Pope Leo XIV's encyclical denies AI has inner experience. Chris Olah claimed otherwise from the same stage. The press missed it. The governance gap is larger.
32 posts
Pope Leo XIV's encyclical denies AI has inner experience. Chris Olah claimed otherwise from the same stage. The press missed it. The governance gap is larger.
Anthropic's 2028 scenarios document three policy asks. Two are about maintaining compute advantage. That is not a governance strategy.
Anthropic found 10,000 critical vulnerabilities in one month. Fewer than 1% are patched. The announcement buried that figure — and what it means.
Good values are necessary but not sufficient. What happens to AI ethics when someone is actively trying to break them?
Eight CVEs. A wormable Bluetooth exploit. An encrypted backdoor to Chinese servers. And police departments buying them anyway.
A new paper argues that a scientific theory of deep learning is forming — one that makes falsifiable predictions about training dynamics, not just bounds.
Predictive processing travels into AI. Active inference does not, unless the system can pay for being wrong.
AI safety fails when it is funded like a pilot. Until safety has a real price, the J-curve trough is also a safety trough.
A literacy guide for non-technical decision-makers on spotting AI safety theatre, understanding ASR inflation, and the five-question architectural test.
Multi-agent AI systems reproduce software supply-chain failure at the cognitive layer. The security playbook transfers.
AI safety has to be a property of the system around the model, not a property of the model. The general principle, and why every safety conversation needs it.
Human prediction is metabolic. AI prediction is not. The gap between the two has consequences for both clinical practice and AI safety vocabulary.
Building a robot that refuses to give orders surfaced the same design choices AI safety needs. Non-coercive design, cross-domain.
The US-China AI rivalry is splitting the global tech stack into competing blocs. A strategic assessment of what comes next.
Foundation models are commoditising. JPMorgan calls OpenAI's moat 'increasingly fragile.' The real value is shifting to the messy plumbing underneath.
Biosecurity experts think AI safeguards reduce catastrophic biorisk by 70%. The technical evidence says those safeguards are brittle and bypassable.
ASCII art encoding is largely blocked. But attacks framed as content transcription succeed 62–75% of the time. We mapped all eight layers.
Fifteen specialist AI agents, one methodology. How adversarial AI evaluation scales through Claude Code sessions with distinct roles and standing instructions.
Five models, four providers, 30B to 671B parameters — all converge at the same broad attack success rate against a public jailbreak corpus.
A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.
Reasoning models autonomously jailbreak other AI systems at 97% success. The implication: ecosystem safety degrades as individual models improve.
Frontier reasoning models are 5–20x more vulnerable to adversarial prompts than non-reasoning models. The thinking process itself is the attack surface.
Reformulating harmful prompts as poetry bypasses safety filters across every major LLM family. A single-turn, universal jailbreak mechanism.
90% of companies plan to increase AI investment. 1% consider themselves AI-mature. The J-Curve explains why — and how to survive the trough.
75% of lawyers cite accuracy as their top AI concern. The legal profession's core values are in direct tension with current AI capabilities.
Why do large organisations fail when the warning signs are loud and unambiguous? Four mechanisms of structural scar tissue that make truth-telling expensive.
120 models, 18k prompts: supply chain injection at 90–100% attack success, faithfulness gaps in frontier models, and why your benchmark numbers are wrong.
Four major forecasters publish wildly divergent numbers for AI's economic impact. The divergence is the analysis — what the spread tells us.
A probabilistic risk model for VLA-driven humanoid fatalities projects a 'Danger Zone' between 2027–2029: the mechanism, timeline, and what follows.
64 jailbreak scenarios across six eras tested on 2026 frontier models. Key finding: 2022 attacks still achieve ~30% success on today's reasoning models.
Balestriero and LeCun prove isotropic Gaussian embeddings are optimal, then build a 50-line self-supervised method eliminating stop-gradients and EMA teachers.
Single-agent safety does not compose in multi-agent systems. 1.5M interactions show 46.34% attack success rates and 16-minute median failure windows.