March 6, 2026(Updated: Mar 18, 2026)2 min read

OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety

OpenAI has published a new study on the controllability of chain-of-thought (CoT) in reasoning models—and arrived at a highly paradoxical conclusion

Jack

Editor

OpenAI has published a fascinating study on the controllability of chain-of-thought (CoT) in reasoning models, arriving at a paradoxical conclusion: all 13 tested models fail miserably at managing their own "thoughts," even when they know they are being monitored. The absolute peak performance among cutting-edge models was a mere 15.4%. However, according to researchers, this epic failure is actually great news for AI safety.

To evaluate this, OpenAI created an open benchmark called CoT-Control, comprising over 13,000 tasks based on datasets like GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified. For each task, a specific constraint is added directly targeting the reasoning process itself. For example, the model might be instructed not to use a specific keyword, to write exclusively in lowercase, or to append a specific word to the end of every sentence. The model’s ultimate goal is to solve the task correctly while strictly adhering to this CoT constraint.

The results showed that models don’t just fail to control themselves—they actively notice their own failures. In the generated reasoning tracks, it is common to see a model realize it broke the instruction, yet continue to do so anyway. A classic example: a model is forbidden from using the word "XOR" in its reasoning. During the process of solving a logic puzzle, it inevitably uses it, and then literally writes: "Oops, I already wrote the forbidden word." While larger models demonstrate slightly better control over their reasoning, researchers note that this effect can currently be neutralized by increasing the length of the reasoning chain and applying additional Reinforcement Learning (RL). For instance, with the OLMo-3-7B model, controllability plummeted by more than 10 times during RL training.

This research heavily complements OpenAI’s previous work on CoT observability published in December 2025, which explored whether an "observer" model could catch problematic AI behavior simply by reading its reasoning. This new study flips the premise: can a model purposely deceive such an observer?

The answer, for now, is no. And this is a crucial argument proving that CoT monitoring remains a highly viable and effective mechanism for AI control. However, "for now" is the operative phrase. OpenAI openly admits that as models continue to scale in size and capability, this dynamic could drastically change.

Author's Note: P.S. If you enjoy my content, you can support me by subscribing to my channel, "The Escaped Neural Net" (Сбежавшая Нейросеть), where I explore the highly creative and unpredictable side of Artificial Intelligence.

Tags:#AI #Machine Learning #Ethics

Share this article

Subscribe to the AI Talk Newsletter: Proven Prompts & 2026 Tech Insights

Revolutionizing Discovery: AI-Driven Archaeological Site Mapping

Discover how advanced artificial intelligence and machine learning algorithms are accelerating archaeological site mapping to uncover hidden historical ruins across the globe today

Ancient stone inscription being decoded by digital neural network visualization

AIJun 4, 2026

AI-Driven Epigraphic Linguistic Reconstruction

Explore how advanced neural networks and machine learning models are revolutionizing the field of epigraphy by reconstructing damaged ancient texts with unprecedented accuracy

Subscribe

Read Next

Revolutionizing Discovery: AI-Driven Archaeological Site Mapping

AI-Driven Epigraphic Linguistic Reconstruction