LLMs are getting pretty good at talking. Getting them to reliably act on a computer — clicking, typing, and navigating real websites to achieve a goal — is a different beast.
At Amazon’s AGI Lab, one of our primary research efforts is to massively scale reinforcement learning (RL) into a practical engine for training computer-use agents (CUAs). Through this work, one lesson has become abundantly clear: useful agents do not emerge from a better model alone. They come from an end-to-end system that addresses four core problems:
- The data problem: Building synthetic RL gyms, designing tasks, and verifying success at scale.
- The reasoning problem: Keeping (and improving) the base model's reasoning so it stays a strong planner and problem solver for complex tasks.
- The algorithmic problem: Making RL stable and sample-efficient for long-horizon web tasks.
- The infrastructure problem: Keeping a complex training loop fast and reliable.
If any single layer is weak, no amount of gradient descent will save you. At a high level, a scalable and practical recipe for web agents should include the following layers:
The data layer: Why we need gyms
You don't want a half-trained RL agent exploring the open web. An untrained agent is a chaotic entity. It will click random buttons, delete data, or buy $5,000 items. On top of that, the web is non-stationary, meaning it changes dynamically in real-time. If a site updates its UI overnight, yesterday's correct trajectory can turn into today's misleading example.
So, we train agents in web gyms: controlled environments that simulate real web workflows inside a sandbox. However, building a good gym is nontrivial. To drive learning, a gym needs five properties:
- Realism: The DOM, layout, and JavaScript behaviors should be close enough to the open web that skills actually transfer.
- Explorability: If the environment is too simple (e.g., fake buttons that do not click), the agent creates a mental model of a "happy path." It then fails instantly when faced with real-world noise.
- Data diversity and hydration: Agents overfit easily. A gym needs "hydration" (i.e., many entities and diverse layouts) so that it can generate a wide range of tasks that differ in both structure and difficulty.
- Correct verifiers: RL needs a reward signal. At the end of a trajectory, we need to answer questions like: Did the agent succeed? Did we actually pay the bill? Did we submit the correct form fields? If your verifier is noisy (giving a reward when the task was not actually done) or wrong (failing to reward a valid completion), the RL algorithm will optimize for that noise. Correct, robust verifiers are as important as the tasks themselves.
- Infrastructure stability: Gym infrastructure is part of the training loop. If it’s flaky — timeouts, nondeterministic behavior, brittle resets — it makes RL more unstable.
One thing we've learned is that most of the learning signals come from high-quality task design.
It is not enough to simply ask an agent to "browse a website." A good task must force the model to exercise specific capabilities under constraints. For example, a task like "Buy a shirt" is poor because it's vague. A better task is "Buy the cheapest blue cotton shirt available in size M." This task forces the agent to search, filter, compare prices across pagination, and validate attributes before acting.
Good gyms, good verifiers, and good task design form the substrate. But the substrate alone isn't enough. You also need an agent that can reason.
The reasoning layer: Learning to be a smart agent
Before we get to RL, we need a strong starting point. Many real computer-use workloads aren't just 'click one button’ tasks. Rather, computer-use workloads are often complex and uncertain. The agent cannot memorize every possible UI, so it has to reason: "what should I look at next," "does this page match the constraints," and "what changed after my last action." Reasoning is how the agent stays alive outside its training distribution.
For CUAs, reasoning is the executive control loop that decides:
- what to do next,
- whether the current page matches the goal,
- when to backtrack, and
- how to recover when the environment behaves differently than expected.
We've found that strong starting models bring general reasoning capabilities that transfer well to web tasks:
- Decomposition and planning: Turning a vague goal ('set up monthly invoicing for this customer') into a sequence of subtasks: find the customer, open billing settings, configure schedule, verify totals, and send a test invoice.
- Search and exploration: When the right path isn't obvious, trying a few hypotheses ('maybe it's under Billing or Subscriptions') until something works.
- Self-monitoring: Checking progress, revising the plan, undoing mistakes, and avoiding loops.
Does reasoning help web agents?
This transfer shows up empirically. Even when trained only on general reasoning data (e.g., math/coding style reasoning) rather than web-specific supervision, we often observe measurable improvements on web tasks (Figure 1).
Why does reasoning help on web tasks? A concrete example:
In tasks with hierarchical menus and hidden UI structure, a brittle policy may overfit to surface patterns (e.g., scrolling) and give up when progress stalls. A stronger reasoner forms a hypothesis about where information should live and tests it.
Finally, reasoning can degrade during specialization if training over-optimizes narrow patterns. Two practical mitigations are: (1) continuing to mix reasoning-heavy data alongside agentic data to preserve planning and constraint tracking, and (2) using higher-bandwidth feedback on failures (e.g., natural-language critiques) when scalar rewards are too sparse to teach why an attempt failed.
The algorithm layer: Stability and sample efficiency
Beyond reasoning data, we still need to specialize the model into an expert web agent via RL. Web RL is hard for structural reasons:
- trajectories are long, up to hundreds of steps,
- action spaces are huge,
- rewards are sparse.
In our experiments, three algorithmic themes matter most in practice:
- The train-inference gap (and why it shows up as 'mystery drift'): Most scalable systems separate a rollout engine (collecting trajectories) and a training engine (updating weights). If these systems differ, even subtly, those differences can compound over long horizons. The model updates toward what training 'thinks' the policy is, while rollouts sample from something slightly different. Practical mitigations:
- Numerical alignment: Use consistent precision and numerics across rollout and training (e.g., align FP16/BF16 behavior) to reduce silent logit drift.
- Sequence-level off-policy correction: When rollouts are off-policy, importance sampling (IS) is the mathematically principled correction to keep objectives aligned.
- Truncated importance sampling: Truncation is a variance-control technique that can make training more stable, with a bias-variance trade-off. The key piece is still the IS correction; truncation is a pragmatic stabilizer.
- Learning from failures without destroying useful behavior: Web agents must learn what not to do. But naively treating every failed trajectory as 'push down everything' can suppress broadly useful sub-skills (e.g., navigation patterns that were correct early but failed due to a later mistake). Two stabilizers that are often useful in practice:
- Partial credit where possible: If the verifier can award intermediate progress signals (milestones), you reduce the all-or-nothing brittleness of sparse rewards.
- Loss normalization: Long unsuccessful trajectories can swamp the gradient budget. Normalizing or aggregating loss at the sequence level (rather than letting long episodes dominate by token count) helps keep training focused on learning signal rather than length.
Curriculum and curation: Spend RL budget where it teaches
Throwing thousands of tasks at the model uniformly is wasteful. Some tasks are too easy (already solved, low learning signal), and some are too hard (zero success, pure noise). What is easy or hard for the model also keeps changing as it learns.We built a curriculum sampler component that tracks task outcomes (e.g., recent success rates) and shapes the sampling distribution over time.
A practical strategy is to emphasize tasks in a learning sweet spot - not too easy, not impossible (for example, a mid-range success band like ~30-70%). This keeps the RL budget concentrated where gradients are most likely to improve competence.
This is not just a training trick, but a scaling strategy. Curation is how you turn 'lots of tasks' into 'useful tasks.'
The infrastructure layer
All the above assumes your system is still running and running fast enough. At scale, RL is training and inference intertwined: we continuously generate rollouts, score them, and push updates back into the policy. In many large-scale RL pipelines, rollout generation often dominates wall-clock time and becomes a primary scaling bottleneck.
Why is rollout so hard to scale? Autoregressive decoding is sequential per trajectory and the decode phase is often memory-bandwidth-bound, which limits how much speedup you get from naïvely adding more GPUs. Worse, rollout lengths follow a long-tailed distribution: a small number of very long samples can stall synchronous batches, leaving hardware underutilized while the system waits for stragglers. This is why the research community is actively exploring strategies like asynchronous generation/training, tail-aware batching, and partial rollout continuation.
Finally, efficiency only matters if the system is reliable. Training a CUA is orchestration: browser and gym containers, rollout workers, training engines, verifiers, evaluation, and rigorous accounting. A robust RL system needs fault tolerance by design. Some gyms might crash. Some rollouts might time out. Some pages might hang. The question is not 'can we avoid all failures,' but:
- How do we categorize failures?
- Do we retry? How many times?
- When do we mark an episode as invalid vs. failed?
- How do we ensure metrics remain trustworthy?
- Can the system recover without manual babysitting?
Good infra turns RL from a fragile experiment into a scalable engine: keeping GPUs busy, metrics honest, and research iteration loops tight.
Training recipes
At a high level, a practical recipe for web agents looks like:
- Start from a base model with strong general reasoning.
- Mix in reasoning-heavy data and agentic web tasks in SFT and RL.
- Wrap curriculum and curation around all of this.
Takeaways
- Data is a major bottleneck. You need realistic, stable gyms with reliable verifiers. Task design matters more than task quantity.
- Reasoning can't be an afterthought. Reasoning is essential for solving complex web tasks. Mix in reasoning data to maintain general problem-solving. Use verbal feedback so the model learns why it fails.
- Algorithms must handle stability and sample efficiency. Credit assignment, train-inference mismatch, learning from negative examples, and data curation all matter.
- Infrastructure needs to be robust and fast: RL runs should sustain high throughput for days or weeks, recover automatically from failures, and keep metrics trustworthy without constant babysitting.
Scaling RL for computer-use agents is not about one trick. It’s about making every layer of the system scale together: realistic gyms and reliable verifiers, strong reasoning, stable and efficient algorithms, and robust infrastructure. When those layers line up, each additional unit of compute buys you better learning signal, faster iteration, and more capable agents.