Large language models today can solve algebra, pass academic benchmarks, and generate highly structured chain-of-thought explanations. In text-only settings, they often feel startlingly intelligent — methodical, articulate, even strategic. But place those models inside an interactive environment — ask them to click buttons, scroll pages, fill out forms, and submit answers — and their behavior changes. Their careful reasoning falters. They guess where they once deduced. They adhere to templates and produce limited procedural narration: stating what they see and what they will click next, without first forming a structured plan and acting in accordance with plan. It’s as if part of their intelligence has quietly gone offline the moment the cursor appears.
This discrepancy reveals a critical limitation: Reasoning ability doesn’t seamlessly carry across modalities. A model that can reason effectively when a question is presented as plain text does not necessarily reason as effectively when the same question is embedded inside an interactive interface. When the problem appears in a text-only prompt, the model’s objective is clear: interpret the question, deliberate, and produce an answer. But when the identical question is rendered inside a webpage — with visual layout, HTML structure, input fields, and the requirement to click or type — the cognitive demands change. The model must parse the UI, decide how to act within it, and manage state transitions, all while preserving the underlying reasoning process. In practice, this shift in modality often disrupts reasoning. Bridging this gap requires more than additional demonstrations. It requires training that explicitly reinforces reasoning under interaction. This is where reasoning reinforcement learning (Reasoning RL) becomes essential.
Reasoning is the stability layer beneath agentic behavior
Reasoning is not an optional enhancement layered onto language models; it is the core capability that enables planning, adaptation, and generalization. Strong reasoning underpins the ability to decompose complex goals into manageable steps, recover from mistakes, adapt to changing interface states, and handle tasks that deviate from familiar templates. Without it, models tend to overfit to narrow benchmarks and surface patterns.
Web environments are interactive, stateful, and often require exploration beyond familiar patterns. Every click reshapes the interface. Every new page updates the underlying state. A misread button or overlooked field can quietly snowball into total task failure. If we expect agents to handle real-world workflows reliably — booking travel, managing dashboards, navigating enterprise tools — their reasoning can’t just survive these dynamics; it has to remain stable and even sharpen under interactive pressure.
Same model, same question — when intelligence fails to transfer
During large-scale pretraining and continued fine-tuning, models are exposed to canonical academic datasets such as GSM8K, MMLU, MMMU, and ChartQA. These datasets equip models with substantial world knowledge and reasoning skills. In text-only evaluations, the results are strong. Present a mathematical equation as plain text, and the model produces a coherent chain-of-thought and a correct solution.
However, when the identical problem is embedded inside an interactive webpage — rendered in HTML, requiring the model read the question from the page, click into an input box and type the answer — performance drops sharply. Instead of solving the equation, the agent often generates procedural narration: it acknowledges the presence of an input field and declares its intention to type the answer, but omits meaningful symbolic reasoning.
The knowledge is still encoded in the model’s weights. The reasoning patterns were learned during pretraining, but in agent mode, they fail to activate properly. This isn’t a loss of intelligence, but a failure of transfer across modalities. The shift from text-only prompts to interactive, multimodal environments disrupts the deployment of reasoning capabilities. Understanding and addressing this gap became a focus of our experiments within Amazon AGI.
Diagnosing the modality gap with interactive benchmarks
To systematically investigate the problem, we built agentic versions of academic benchmarks—interactive “reasoning gyms” that wrap canonical datasets inside controlled environments. These gyms preserve the core intellectual challenge of the task while introducing interaction. Questions are rendered in webpages. Answers must be submitted through input fields. The model operates in agent mode rather than pure text generation.
We built environments for multiple classic benchmarks, and each benchmark was evaluated in multiple configurations: traditional text-only zero-shot prompts, few-shot prompts where applicable, fully rendered gym environments, and variants where the question text was explicitly included in the prompt to isolate potential OCR issues.
The results clearly exposed the modality gap. On MATH, for instance, text-only zero-shot performance substantially exceeded gym performance. On GSM8K, text-only accuracy was high, but performance in the interactive environment collapsed. The model could solve the problems in principle; it simply struggled to reason when required to act within a webpage. This gap suggested that supervised fine-tuning alone was insufficient. We needed a training signal that directly reinforced reasoning under interaction.
Mathematics as a launchpad for agentic intelligence
We began by applying reinforcement learning directly to the training split of MATH gym (agentified questions from the MATH dataset). Instead of teacher-forcing step-by-step solutions, we required the model to generate fully on-policy rollouts — reasoning, acting, observing, and adapting within the live interface. Rewards were issued only when a trajectory ended in a correct submission.
The model didn't just get better at math — it became better at reasoning in interactive settings.
The early results were encouraging. After roughly one epoch over just a few thousand questions, the gap between text-only evaluation and gym evaluation shrank drastically. The model learned to parse rendered equations, carry out symbolic reasoning while navigating the page, and type correct answers into the input box. The same reasoning that previously faltered under interaction began to hold steady across perception, action, and state updates.
More surprisingly, the gains were not confined to mathematics. Although the reinforcement learning tasks focused solely on mathematics, improvements carried over to entirely different domains. Performance also improved on agentic MMLU tasks, which assess high school- to college-level world knowledge and reasoning across subjects such as history, economics, law, medicine, biology, physics, and other academic disciplines.
This suggests the improvement wasn’t just domain-specific memorization or narrow skill tuning. The model didn’t just get better at math — it became better at reasoning in interactive settings. By learning to think, act, and adapt coherently within a structured environment, it developed skills that transfer to other tasks requiring state tracking, careful reading, and deliberate decision-making. Sharpening the model on MATH produced meaningful spillover effects, strengthening its agentic abilities across a broad range of domains.
Beyond math: A reasoning curriculum
Training web agents ultimately requires reinforcement learning on full, end-to-end interactive workflows. In these settings, the model is expected to navigate real webpages, fill out forms, apply filters, and scroll through dynamically loaded content. Encouraged by early experiment results and aiming to further advance the model’s agentic intelligence, we decided to introduce a dedicated reasoning RL phase prior to the web-task reinforcement learning stage. In this phase, rather than optimizing over step-level instruction execution, we confined training to domains with precise, automatically verifiable answers. This allowed us to shape the model’s internal reasoning process — problem decomposition, intermediate deduction, and self-verification — without the confounding noise of complex UI interaction. By strengthening this cognitive substrate in isolation, we ensured that subsequent web-task RL would build on a more deliberate and structured reasoning policy.
We expanded the training curriculum to include multiple reasoning domains with reliable verification. In mathematics, we expanded to increasingly difficult competition problems from AMC and AIME to encourage deeper logical deduction and structured skill progression. We added coding tasks from MBPP (training split) to develop procedural and algorithmic reasoning. We also incorporated structured information understanding tasks—including ChartQA, WikiTable extraction, and scientific question answering — that require interpreting tabular and visual inputs. These environments strengthen the model’s capacity for grounded quantitative reasoning, such as extracting key values, comparing magnitudes, and inferring patterns from structured data. They encourage robust grounding by forcing the model to tie its answers directly to structured evidence.
All tasks share a key advantage: clear ground truth and dependable reward signals. Because the correct outcomes are easy to verify and inexpensive to scale, they provide an efficient source of tasks for training and systematic evaluations.
As training progressed, we observed qualitative changes. The model began generating longer, more detailed chains-of-thought. It became more willing to backtrack when intermediate deductions failed. It exhibited stronger schema understanding when parsing tables and improved quantitative interpretation of charts. Importantly, these improvements generalized across domains.
Stability and on-policy training
Maintaining stability during RL was essential. We relied on on-policy training to ensure that reasoning traces reflected the model’s own internal state rather than teacher-forced guidance. At the same time, mechanisms such as KL regularization helped prevent reward collapse and excessive invalid actions. Preserving sufficient entropy during Stage 1 training was critical to maintaining exploration capacity for subsequent large-scale web RL. The outcome was a base policy that not only strengthened core agentic intelligence, but also consistently outperformed pure supervised fine-tuning across real-world web workflows.
Structured reasoning in the wild
The benefits of reasoning-focused reinforcement learning extend beyond controlled base intelligence gym environments and become especially evident in real-world web workflows. Consider a multi-step task that involves searching, selecting dates, navigating listings, scrolling, and inspecting a detailed amenities section. In such scenarios, a baseline agent trained directly with RL on web workflow tasks often overfits to superficial chain-of-thought templates rather than developing robust reasoning capabilities. As a result, it would misinterpret the task requirements, prematurely conclude that it has completed the task, or return information without verifying the current page state.
In contrast, an agent trained with reasoning RL demonstrates more deliberate and state-aware behavior. It checks for and handles pop-up windows, reflects on the outcomes of its actions, and explicitly inspects relevant sections of the page before proceeding. Rather than following memorized navigation patterns, it interprets the task requirements and validates that the necessary conditions are met before returning an answer. The key difference is the emergence of structured, context-sensitive reasoning grounded in the current state of the environment.
This contrast becomes even clearer in tasks that require more precise interpretation of page content. For example, in a workflow that involved counting reviews containing a specific term, the baseline agent again exhibited brittle behavior: it scrolled aimlessly, failed to isolate the relevant information, and ultimately terminated with an error. In contrast, the agent trained with reasoning-focused RL approached the task methodically. It recognized when critical information was not immediately visible, navigated deliberately to the appropriate sections, and refined its search. Rather than executing arbitrary action sequences, it formed and tested hypotheses, using intermediate observations to guide subsequent steps. This pattern of deliberate exploration and verification further illustrates how reasoning RL promotes coherent, state-aware problem solving rather than superficial pattern matching.
Reasoning as a reinforced habit
During large-scale pretraining, models internalize latent reasoning patterns across text, code, and instruction data. However, downstream fine-tuning for agent efficiency can suppress these patterns, encouraging shorter, more procedural outputs. Reasoning RL reactivates and amplifies these latent capabilities by rewarding structured, goal-directed reasoning under interaction.
The training loop repeatedly reinforces a pattern: observe the environment, think through its implications, execute a targeted action, verify the result, and adjust if necessary. Over time, this loop becomes internalized. The model stops treating webpages as scripts to execute and begins treating them as environments to reason within.
Reliable agents start with portable reasoning
The key lesson is that modality alignment depends on deliberately strengthening reasoning capabilities. Before scaling complex web RL on open-ended tasks, it is beneficial to first reinforce the model’s reasoning substrate in controlled, verifiable domains. A structured curriculum of reasoning gyms enables reliable transfer across modalities, restores suppressed capabilities, and promotes cross-domain generalization — ultimately producing a more stable and intelligent base policy for subsequent web-scale training.
Reliable agents do not emerge solely from larger models or more trajectories. They emerge when reasoning is deliberately strengthened as a fundamental skill for web interactions. If we want agents that can plan, recover, adapt, and generalize in real-world workflows, we must train them not just to act—but to reason while acting. Reasoning RL is not an auxiliary optimization: it is a foundational step toward agentic intelligence that is coherent, transferable, and robust.