For years, Big Tech leaders have promised a future where AI agents handle everyday digital tasks on our behalf, booking travel, shopping online, or managing workflows. But take today’s consumer AI agents like OpenAI’s ChatGPT Agent or Perplexity’s Comet for a spin, and you will quickly notice their limits. The truth is, the industry is still figuring out how to make these agents truly reliable.
One of the techniques gaining traction is building reinforcement learning (RL) environments, simulated workspaces where AI agents can practice complex, multi-step tasks. Just as labeled datasets powered the last wave of machine learning, RL environments are emerging as the missing ingredient for AI agents to mature.
Investors, researchers, and founders say demand is spiking. “All the big AI labs are building RL environments in-house,” Jennifer Li, general partner at Andreessen Horowitz, mentioned. “But as you can imagine, creating these datasets is very complex, so AI labs are also looking at third-party vendors that can create high-quality environments and evaluations. Everyone is looking at this space.”
That surge in demand is fueling a new class of startups, from Mechanize to Prime Intellect, all hoping to become the “Scale AI for environments.” Big data-labeling firms like Mercor and Surge are also pivoting into RL simulations to stay relevant. The money flowing in is massive: The Information reported that Anthropic has discussed spending more than $1 billion on RL environments over the next year.
So what are these environments, really? Imagine a stripped-down video game, but designed to train software instead of entertaining players. A simulation might replicate a Chrome browser and task an AI agent with buying socks on Amazon. The AI is rewarded when it succeeds, but there are countless ways it can fail, from clicking the wrong menu to ordering ten pairs instead of one. The environment’s job is to capture all those missteps and still provide useful feedback.
Some simulations are narrow, teaching an AI to perform one enterprise task really well. Others are sprawling, letting an agent use tools, browse the web, or interact with multiple apps. That makes building environments harder than curating a dataset, because you are a whole digital world with infinite paths, not a static set of examples.
Related: Silicon Valley Pumps $100M Into Pro-AI PACs Ahead of Midterms
This is not entirely new. OpenAI experimented with “RL Gyms” back in 2016, and Google DeepMind’s AlphaGo famously used reinforcement learning to beat a world champion at Go. The difference now is scale: researchers want to build general-purpose computer-using AI agents, trained on massive transformer models. And that is a lot trickier than teaching an AI to master one closed game.
The gold rush is already creating a crowded field. Surge, which reportedly generated $1.2 billion in revenue last year working with OpenAI, Google, and Meta, has spun up a new unit just for RL environments. Mercor, valued at $10 billion, is pitching investors on domain-specific environments for coding, healthcare, and law. Scale AI, once the data-labeling powerhouse, has lost ground since Meta poached its CEO, but it is scrambling to adapt by building environments too.
Meanwhile, startups are carving out niches. Mechanize, launched just six months ago, is dangling $500,000 salaries to top engineers to build robust environments and is already working with Anthropic, according to sources. Its strategy: fewer, deeper environments rather than mass-producing simpler ones. Prime Intellect, backed by Andrej Karpathy, Founders Fund, and Menlo Ventures, wants to democratize access with a hub it calls a “Hugging Face for RL environments.
The challenge is scale. Reinforcement learning has fueled major AI leaps. OpenAI’s o1 and Anthropic’s Claude Opus 4 both relied on it, but it is costly, complex, and prone to quirks. Researchers warn of “reward hacking,” where an AI figures out shortcuts to earn points without actually completing the task.
Even some insiders are cautious. Ross Taylor, a former Meta AI lead, says most environments don’t work “without serious modification.” OpenAI’s Sherwin Wu recently said he’s “short” on RL startups, citing brutal competition and the difficulty of keeping up with fast-moving labs. And even Karpathy himself, an investor in Prime Intellect, has tempered his enthusiasm: “I am bullish on environments and agentic interactions but bearish on reinforcement learning specifically,” he posted on X.
Still, the momentum is undeniable. RL environments may not be perfect, but they are shaping up to be the proving grounds for the next phase of AI. The real question is not whether AI labs will use them, it is whether environments will deliver breakthroughs big enough to justify the billion-dollar bets. Are reinforcement learning environments the missing piece that will finally make AI agents useful, or just another costly experiment on the road to real autonomy?