OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety.

OpenAI has published a fascinating study on the controllability of chain-of-thought (CoT) in reasoning models, arriving at a paradoxical conclusion: all 13 tested models fail miserably at managing their own "thoughts," even when they know they are being monitored. The absolute peak performance among cutting-edge models was a mere 15.4%. However, according to researchers, this epic failure is actually great news for AI safety.

[ 💰 Ad Slot / Рекламный блок ]

To evaluate this, OpenAI created an open benchmark called CoT-Control, comprising over 13,000 tasks based on datasets like GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified. For each task, a specific constraint is added directly targeting the reasoning process itself. For example, the model might be instructed not to use a specific keyword, to write exclusively in lowercase, or to append a specific word to the end of every sentence. The model’s ultimate goal is to solve the task correctly while strictly adhering to this CoT constraint.

The results showed that models don’t just fail to control themselves—they actively notice their own failures. In the generated reasoning tracks, it is common to see a model realize it broke the instruction, yet continue to do so anyway. A classic example: a model is forbidden from using the word "XOR" in its reasoning. During the process of solving a logic puzzle, it inevitably uses it, and then literally writes: "Oops, I already wrote the forbidden word." While larger models demonstrate slightly better control over their reasoning, researchers note that this effect can currently be neutralized by increasing the length of the reasoning chain and applying additional Reinforcement Learning (RL). For instance, with the OLMo-3-7B model, controllability plummeted by more than 10 times during RL training.

[ 🖼️ Image Slot: Картинка (Графики из исследования или абстрактный арт с мыслями ИИ) ]

This research heavily complements OpenAI’s previous work on CoT observability published in December 2025, which explored whether an "observer" model could catch problematic AI behavior simply by reading its reasoning. This new study flips the premise: can a model purposely deceive such an observer?

The answer, for now, is no. And this is a crucial argument proving that CoT monitoring remains a highly viable and effective mechanism for AI control. However, "for now" is the operative phrase. OpenAI openly admits that as models continue to scale in size and capability, this dynamic could drastically change.

[ 💰 Ad Slot / Рекламный блок ]

Author's Note: P.S. If you enjoy my content, you can support me by subscribing to my channel, "The Escaped Neural Net" (Сбежавшая Нейросеть), where I explore the highly creative and unpredictable side of Artificial Intelligence.

[ 🖼️ Image Slot: Картинка призыва подписаться (Subscribe to Newsletter / Channel) ]

OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety.

Часто задаваемые вопросы

Read Next

Microsoft Integrates OpenAI’s New GPT-5.3 Instant into 365 Copilot.

The End of Anonymity: How AI Unmasks Your Digital Identity.