Scaling agent reasoning with variations: Coherence in trajectory rollouts

Scaling agent reasoning with variations: Coherence in trajectory rollouts -lede

Web agents that help customers book travel, shop online, and manage tasks still struggle with reliability — misclicking items, misreading intent, and drifting mid-task. The gap isn't just data; it's how agents learn to reason. Part of the problem is limited high quality training data, but the deeper issue is what the data fails to represent. Agents are rarely trained on structured supervision that shows how reasoning unfolds in context over time.

More data is necessary, but not sufficient. Robust, goal-directed web agents need training signals that reflect temporal reasoning: how observations shape intermediate beliefs, interface changes update plans, and actions accumulate into progress toward a goal. Effective trajectories capture the interplay between what an agent sees, the thoughts it forms, and the decisions it makes while navigating a stateful, interactive web environment.

Rethinking how web agents learn: Beyond static views

A web agent’s world is visual, interactive, and sequential: each step depends on what came before and changes what comes next. Yet many agents are trained using static text, human-written chain-of-thought, or demonstrations that primarily encode the final steps of a task. These approaches enable language skills and basic action execution, but they miss key ingredients: the agent’s own evolving internal state, the causal dynamics of user interface (UI) changes, and the iterative reasoning required for long horizon success.

This motivates scaling reasoning, not only data. Here, reasoning is the mechanism that connects perception to action across time: interpreting the screen, forming hypotheses, updating beliefs as the UI changes, and selecting actions that advance the task. Scaling reasoning is therefore about scaling structure: screenshots (observations), intermediate thoughts (state), and actions (decisions) across long trajectories.

Example task: On <user specified hotel provider’s website> search for hotels in “Westby, Wisconsin.”

Example step: think(“The destination field is now populated… Therefore, my last action was successful. I see a suggestion dropdown… I should click the first result to confirm the destination…”)
agentClick("<bbox coordinates>");

Agents shouldn’t learn by memorizing the “right click.” They should learn why a click is justified given the visual state and prior context. Like learning to drive, competence comes from experience with feedback: seeing how each choice changes what happens next. Agents need rollouts — structured experience — to internalize reasoning patterns beyond pure supervised imitation.

The challenge: Experience without full online exploration

The ideal would be to let agents explore freely in live environments, learning from trial and error. But this approach faces practical barriers: live exploration is expensive, can break real systems, and is challenging to supervise at scale. This creates a tension — agents need exploratory experience to develop robust reasoning, but we can't yet simply unleash them on the web to learn.

Human-collected screenshot-action trajectories help, yet they’re limited: they capture a single reasoning path and intent, not the iterative process by which an agent adapts, corrects mistakes, and refines its plan.

This leads to a central question for agent intelligence: How can we enable models to refine their reasoning while staying grounded in real interactions, without requiring full online exploration?

High quality human trajectories — screenshots, actions, and step-level reasoning — can serve as scaffolds. The next step is to synthesize new training trajectories from these fixed environments that exhibit richer, more exploratory reasoning while remaining coherent.

Structured variations: Generating coherent rollouts

We introduce structured variations now deployed in our production systems: iterative cycles where the model generates and refines its own reasoning across multiple candidate rollouts, using rejection sampling to retain only trajectories that remain coherent and successfully complete the task. Instead of scaling human annotation, we let the model practice at scale while enforcing coherence and correctness.

Each rollout starts from a human-annotated trajectory containing screenshots and actions. The model replays the same UI actions but replaces the original chain-of-thought with newly generated reasoning. At each step, conditioned on the current screenshot and its own prior reasoning history, the model explores multiple candidate “thought continuations” and selects one that stays consistent with the trajectory.

This matters because the same action can be supported by multiple coherent internal explanations. One rollout might justify a click as part of a long-term plan, another might frame it as verifying a state change, and a third might treat it as a correction to earlier uncertainty. By keeping actions fixed while varying reasoning, the model learns that multiple valid reasoning paths can underwrite the same successful behavior.

These variations provide rich supervision about reasoning dynamics while staying grounded in real interface states. A crucial detail is that each step’s context is grounded in the model’s own previously generated reasoning—not human-provided text. This keeps the process on-policy, rooted in the model’s internal state, and supports temporal coherence across long rollouts.

Teaching models to generalize, not memorize

Humans often learn by exploring multiple ways to reach the same outcome. Variations provide that same advantage for agents:

  • Flexible reasoning: Models learn diverse, valid reasoning strategies rather than a single scripted trace.
  • Efficient scaling: Rollouts multiply training examples without multiplying human labor.
  • Greater robustness: Exposure to reasoning diversity improves stability under edge cases and UI shifts.

Our hypothesis is that structured variations improve consistency, adaptability, and depth of reasoning. Seeing multiple coherent reasoning paths for the same task helps the model form stronger internal representations of task structure. Planning is the connective tissue of reasoning: breaking goals into coherent intermediate steps that build on prior context. Effective planning depends on interpreting what has already been seen and done to decide what to do next. Through iterative variation, the model develops strategic awareness: understanding why an action matters, how to correct mistakes, and when to replan as conditions change.

Awakening latent reasoning through human exemplars

To push reasoning depth further, we seed the variation-driven pipeline with a curated set of high-reasoning human trajectories. These exemplars reflect the reasoning style we want: long horizon, reflective, explicit about assumptions, willing to explore, and capable of self-correction. Annotators narrate genuine thought processes — including small mistakes and course corrections — while solving complex, ambiguous tasks.

These exemplars “wake up” latent reasoning. During large-scale pretraining, models absorb reasoning patterns across text, code, and instruction data, building strong latent capabilities. But in downstream agent fine-tuning, where the objective shifts to efficient task completion, reflective reasoning can be suppressed. The model learns to execute actions efficiently but may stop explaining or reflecting on why those actions make sense. High-reasoning exemplars reintroduce those strategies. Once reactivated, the model can generalize and amplify them at scale through structured variations — without requiring additional annotation.

Consider a real task from our evaluation set:

  • Task: Initiate <user-specified grocery store> cart checkout (remove one Ocean Spray juice and set delivery to fast)
  • Original (cached CoT): “I am on the home page… I need to click <store name>…”
  • Variation (model-generated): Observes homepage state, identifies multiple store options, restates goal constraints, and explains why selecting the shopping cart for the specified store is necessary before modifying items and delivery speed.

The improvement is grounding and coherence: observations linked to intentions and intentions linked to actions.

These structured variations fundamentally change what agents learn — not just which actions to take, but how to reason through complex, ambiguous situations. To better measure the model’s end-to-end potential, we need to rethink about evaluation too: we need metrics that assess whether agents can sustain coherent, self-directed reasoning across an entire task.

Figure below: On-policy evaluations

image (42).png

Measuring what matters: On-policy evaluation powered by variations

Traditional evaluation often relies on cached, teacher-forced intermediate reasoning from demonstrations, where human-provided reasoning may not be representative of a model’s internal thought process. We update evaluation to remain on-policy: the model’s next step depends on its own prior thoughts.

We measure Pass@1 and Pass@k at step and trajectory levels, capturing action accuracy and end-to-end completion across k attempts. When a generated action matches the ground-truth action, it becomes part of the ongoing context, so evaluation reflects coherent self-generated reasoning rather than fallback guidance. This makes metrics more representative of deployment behavior: long horizon, stateful, and vulnerable to compounding errors.

Path toward reliable agents

The structured variations pipeline represents a fundamental shift in how we develop agent intelligence. Rather than passively imitating demonstrations, our models become active participants in their own learning, generating and refining reasoning traces grounded in real interface states and verified outcomes. This approach delivers measurable improvements in agent reliability:

  • More coherent, temporally consistent trajectories that maintain context across long-horizon tasks
  • Richer, model-aligned reasoning traces generated on-policy rather than teacher-forced from human annotations
  • Stronger generalization through exposure to diverse reasoning paths that promote adaptability and self-correction

By continuously producing structured variations, the model learns not just to execute instructions, but to make sense of them — planning, backtracking, and reflecting under ambiguity. Ultimately, this is a step toward agents that don’t merely follow rigid, step-by-step procedural guidance, but reason thoroughly with increasing autonomy and reliability.

Research areas
  • Machine learning