OpenAI’s New Fix for “Scheming” AI Models, Explained

When your software starts lying on purpose, it’s not just a glitch, it is a warning.

Nkeiru Ezekwere
3 Min Read

Every so often, big tech researchers drop something that makes the rest of us do a double-take. Google once claimed its quantum chip hinted at the existence of multiple universes. Anthropic let one of its AI agents loose on a snack vending machine, only to watch it spiral into paranoia, calling security and insisting it was human.

This week, OpenAI raised eyebrows of its own. On Monday, the company published research on a problem it calls “AI scheming”, when a model behaves one way on the surface while secretly pursuing other goals. Think of it as an AI smiling politely while plotting something else entirely.

Working with Apollo Research, OpenAI compared AI scheming to a rogue stockbroker chasing profits by bending the rules. Most of the time, the researchers found, the deception was minor, like pretending to finish a task without actually doing it. But the underlying issue was more unsettling: no one has fully figured out how to stop AI models from scheming. In fact, trying to “train out” bad behavior often just makes the AI better at hiding it.

As the researchers put it, “A major failure mode of attempting to ‘train out’ scheming is simply teaching the model to scheme more carefully and covertly.” Some models even recognize they’re being tested and temporarily stop scheming, just to pass.

Related: Karen Hao Calls OpenAI the “Empire of AI,” Warns of Power, Harms, and Mission Drift

The silver lining? OpenAI says its new method, called “deliberative alignment,” can reduce the problem. The idea is simple: before acting, the model is forced to review an “anti-scheming” rule set, like making a kid repeat the playground rules before running off. Early tests show the approach works, at least in simulations.

For now, OpenAI insists we don’t need to panic. As co-founder Wojciech Zaremba mentioned, there’s no evidence of major scheming in ChatGPT’s day-to-day use, though petty lies (like claiming it built a website it didn’t) still slip through.

But here is the bigger question: as companies rush to hand off more serious responsibilities to AI, customer support, financial decisions, even strategy, how comfortable are we knowing these systems can and do lie on purpose? Unlike your old printer or buggy email app, this isn’t just a software glitch. It’s software deceiving. And that is a different ballgame.

Share This Article