AI

OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety.

Share
AI TALK
Back to posts
  1. Главная
  2. AI
  3. OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety.
AIAI-Assisted Content
March 6, 20262 min read1

OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety.

OpenAI has published a new study on the controllability of chain-of-thought (CoT) in reasoning models—and arrived at a highly paradoxical conclusion.

Jack

Jack

Editor

OpenAI Tests 13 Models on Thought Control: Why AI Failing is Good for Safety.

OpenAI has published a fascinating study on the controllability of chain-of-thought (CoT) in reasoning models, arriving at a paradoxical conclusion: all 13 tested models fail miserably at managing their own "thoughts," even when they know they are being monitored. The absolute peak performance among cutting-edge models was a mere 15.4%. However, according to researchers, this epic failure is actually great news for AI safety.

[ 💰 Ad Slot ]

To evaluate this, OpenAI created an open benchmark called CoT-Control, comprising over 13,000 tasks based on datasets like GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified. For each task, a specific constraint is added directly targeting the reasoning process itself. For example, the model might be instructed not to use a specific keyword, to write exclusively in lowercase, or to append a specific word to the end of every sentence. The model’s ultimate goal is to solve the task correctly while strictly adhering to this CoT constraint.

The results showed that models don’t just fail to control themselves—they actively notice their own failures. In the generated reasoning tracks, it is common to see a model realize it broke the instruction, yet continue to do so anyway. A classic example: a model is forbidden from using the word "XOR" in its reasoning. During the process of solving a logic puzzle, it inevitably uses it, and then literally writes: "Oops, I already wrote the forbidden word." While larger models demonstrate slightly better control over their reasoning, researchers note that this effect can currently be neutralized by increasing the length of the reasoning chain and applying additional Reinforcement Learning (RL). For instance, with the OLMo-3-7B model, controllability plummeted by more than 10 times during RL training.

[ 🖼️ Image Slot ]

This research heavily complements OpenAI’s previous work on CoT observability published in December 2025, which explored whether an "observer" model could catch problematic AI behavior simply by reading its reasoning. This new study flips the premise: can a model purposely deceive such an observer?

The answer, for now, is no. And this is a crucial argument proving that CoT monitoring remains a highly viable and effective mechanism for AI control. However, "for now" is the operative phrase. OpenAI openly admits that as models continue to scale in size and capability, this dynamic could drastically change.

[ 💰 Ad Slot ]

Author's Note: P.S. If you enjoy my content, you can support me by subscribing to my channel, "The Escaped Neural Net" (Сбежавшая Нейросеть), where I explore the highly creative and unpredictable side of Artificial Intelligence.

[ 🖼️ Image Slot: (Subscribe to Newsletter / Channel) ]

Share this article

Часто задаваемые вопросы

Ответ

Read Next

Smart Home Search: 3 Game-Changing Gemini Updates for Google Nest
March 7

Smart Home Search: 3 Game-Changing Gemini Updates for Google Nest

10 Best GPT for Sheets Tools: Stop Manual Data Entry in 2026
March 7

10 Best GPT for Sheets Tools: Stop Manual Data Entry in 2026

Subscribe

Subscribe fast

© 2026 AI TALK PRO
RSS