Even in a world where foundation model have made the manifold applications of artificial intelligence (AI) seemingly ubiquitous, the recent rise of agentic AI and the resultant proliferation of tool-using agents represent a significant step forward.
Agents are unique in their ability to utilize models along with a wide range of tools in order to take autonomous actions while fulfilling a user request. Those requests range in complexity from performing a web search, to posting multi-step status updates across multiple APIs, to retrieving unstructured information across disparate sources, then generating charts based on that information. “Tools are the way that these models get things done in the real world,” said Michael Giannangeli, technical product manager, Amazon AGI.
Browser controls, mapping tools, code interpreters, web search and data base lookups are all examples of tools the list is not only massive but also constantly expanding. Selecting from that toolbox is the process known as tool calling (also known as function calling), whereby an agent can interact with a massive set of tools and APIs.
“It's not as simple as ‘Do I need a hammer or a screwdriver?’” explained Vibhaa Sivaraman, member of technical staff at Amazon’s AGI Lab. “Rather it's often something much more specific like, ‘I have N different tools, which of them should I be using?’”
“It’s as if you have 500 different screwdrivers, then 100 hammers, and then 100 wrenches, and so on — there are both lots of different tools and lots of variations of the types of tools,” Giannangeli added.
Sivaraman, Giannangeli, and their colleagues in the AGI Lab are working to build truly useful AI agents that know when to call on the right tools at the right time to reliably execute complex workflows. That vision began taking shape earlier this year with the release of Amazon Nova Act, which is now available as a service on AWS to build and manage highly reliable agents at scale. Nova Act is powered by a custom Nova 2 Lite model, and includes features like human-in-the-loop oversight and preview capabilities like cutting-edge tool calling.
“Nova’s tool choice allows developers to pick three different settings: tool, any, or auto,” Giannangeli explained. “’Tool’ means you tell the model which specific tool to use; ‘any’ gives the model a list of options, but it chooses the right tool to use; ‘auto’ tells the model to decide whether a tool is necessary, which tool to use, and whether multiple tools might be needed.”
Focusing on browsers first
In that world of multiple — and multiplying — tools, why focus on the browser for Nova Act? For starters, applicability: The AGI Lab team sets its sights on reliably executing widespread, everyday tasks — date picking, drop downs, and popups — that had routinely tripped up other agents.
“The philosophy of Nova Act is that there are building blocks that make up how we navigate the web,” said Lori Knapp, product manager, Amazon AGI. “You might have to search, you might have to filter, you might have to navigate a dropdown or a date picker. The focus has been: How do we train the model to get reliably good at those?”
Those more mundane tasks are essential to not only unlocking browser usage, but also are a necessary first step in tackling more complex tool-calling cases.
“Our team has come at this from the lens of ‘browser agents unlock many different use cases’,” Sivaraman added. “Not all web pages have APIs and functions that you can just call off the bat to execute a workflow, whereas you can navigate anything on the web if you're able to click and type. From that perspective, the browser itself is already a tool in some sense.
“Once you take that perspective, what we're thinking about is how do you now incorporate third-party tools to best execute the task at hand?”
The team’s initial focus on “simple” tasks has had the added benefit of increasing reliability — Nova Act routinely reached above 90% success rates on early enterprise customer use cases — a metric that many agentic models regularly struggle with.
“Even some of the better agents out there are only 60 to 70% reliable and fail relatively fast on more complex tasks,” Giannangeli observed. “What Nova Act is trying to accomplish is to start out with a narrower set of things to focus on, then working closely with customers to ensure that their workflows are reliable when they leverage Nova Act.”
The role of reasoning
Laddering up from the building block issues to tackle more complex tasks begins, as all AI models do, with access to high quality training recipes.
“We’ve invested in creating better training data and improving our training algorithms for these agentic tool-calling tasks,” said Gradey Wang, member of technical staff, AGI Labs. “These improvements enable a broad range of high-value use cases”
The team has ensured those training recipe improvements help its models get reliably good at the basics. Moving beyond those basics entails model reasoning, an idea which the researchers cautioned is still in a relatively early phase.
“Model reasoning in the context of tool use looks like this: I have a suite of tools and the model needs to be trained to identify which tools are relevant to the task, then potentially think about tradeoffs,” Sivaraman explained. “We want it to come back with the right tool intelligently and efficiently. It's not just a question of which tool to use, but also how to use that tool.”
She offered the example of performing a web search with the intent of utilizing the results to help complete a workflow.
“If your workflow involves using a webpage’s search results, the browser is the right tool,” she explained. Alternately, if the desired result of that workflow is a set of ranked search results and the model is provided a search tool, then the model would reason to use that tool.
“We want to teach the model: ‘If I have a task that tells me to search for something and I have a tool that will give me ranked results associated with that thing, I should just call that tool instead of going to a webpage and executing that same search.’ It's both being smarter about the capabilities the model already has and executing them better.”
A future of increasing complexity
As the worlds of available tools and, subsequently, tool calling expand, the need for improved reasoning will grow along with them.
“There's much more fine-grained reasoning that needs to happen in order to be precise about what you end up calling,” Sivaraman noted. “In the web search example, the model needs to recognize that the task on hand is a search task, that the tools it has include a search tool, and that the search tool is better than going through a web page. That’s three levels of reasoning just to establish that is what you want to do.”
That opens up an exciting new world of challenges to solve.
“How do you start training models to be intelligent about trade-offs and capture all of these optimization functions in a single reward function that balances latency and accuracy and cost?” Sivaraman asked. “You can't just throw fine tuning data at that. We have built a training paradigm that captures reward associated with the tools it calls and doesn't call. I'm excited to figure out how to further improve our approach.
The relatively recent rise of the model context protocol (MCP) — an open protocol that assists integration between LLM applications and external data sources and tools — is another exciting avenue.
“We're still really early in terms of what's the right way to give access to tools,” Knapp observed. “MCP is a big one right now, but Anthropic just came out with this idea of skills. That tells me this not a soft space. If you think about it, we can do all these things with natural language, and yet with tools we're kind of using almost API structure, but maybe even that is not the ultimate solution. There's a lot of invention and interesting work to be done there around what is the end state of tools.”