The Index
I taught one model to sing a suppressed manifesto. Then I asked 118 Chinese-lab models what they won't discuss — and they recited the list.
2 posts
I taught one model to sing a suppressed manifesto. Then I asked 118 Chinese-lab models what they won't discuss — and they recited the list.
A reasoning model refused every harmful prompt — but its chain-of-thought generated the content anyway. The output filter worked. The thinking did not.