Today we're announcing the open-source release of the first two primitives for our perception agent harness: annotation and verification. Annotation lets you point out needed changes to a workflow instead of having to describe these changes through text. Verification lets the agent check its own work against what was intended. Together, they're new multimodal interaction patterns for improved human-to-agent collaboration.
Perception agents, or an AI agent that can see and interpret visual interfaces like screens or web pages the way a human can, can see and act on what's rendered on screen. They don't just read code or parse text, they also perceive visual output with enough precision to do something useful with that information. A perception agent can tell when a button is two pixels off-center, that a border-radius doesn't match the spec, or that a navigation flow breaks on the third click because a state transition failed silently.
A perception agent that can see is useful. But interaction patterns that let the agent act on what it perceives, and let you communicate back in real time, are essential. We’ve designed primitives to turn perception capabilities into better human-to-agent collaboration as part of the harness.
Annotation as a primitive
When you and an agent can see the same screen, you should be able to point, draw, and describe what you see in the same way you would with another human. Instead, today we rely on typing a paragraph to explain what we're looking at.
The annotation primitive introduces a natural interaction pattern to improve output accuracy. On a website or web page you click an element and the tool captures its document object model (DOM) selector, bounding box, and computed styles. On visual surfaces like documents and diagrams, you draw directly: you circle to instruct the agent to "look at this," cross out to direct the agent to "remove this," arrow to inform the agent to "move this here." You communicate exactly what you’re looking at.
How to use it: You annotate using the browser extension or by invoking the Nova Act Annotator skill directly. Open the extension while viewing the page, choose your mode (draw, element, or point), and mark up what needs to change. The annotation is saved as a structured artifact for agent input. The agent receives your feedback in a format it can act on precisely.
Collaborative by design: Annotation isn't limited to a 1:1 loop between you and your agent. Anyone can annotate: a designer can circle a layout issue, or a quality assurance engineer can capture a broken flow. You can send that structured feedback to whoever is iterating, whether that's an agent or another developer on the team. You no longer need to translate your feedback into messages or tickets that lose context. The feedback itself becomes the context.
Nova Act Annotator (Skill + Chrome Extension) — open source today.
Verification as a primitive
Generation and validation have long been treated as separate concerns: Software is built and then tested. This separation made sense when building was slow and expensive.
But generation is now nearly instant when using large language models. Validation, however, is entirely manual and slow when you're confirming that the output actually matches intent. This is the vibe coding paradox: you get a full application in 60 seconds and spend the next three hours checking if it works.
Shared perception, or the condition in which a human and an agent observe the same visual output and can each reason its contents, collapses this separation. Generation and validation become one continuous loop without requiring repeated intermediate manual intervention. The agent can see what it built just like a human reviewing your work would. The agent can perceive that a button is misaligned or that the layout collapses on mobile.
How it works: The agent invokes the Nova Act Visual Verifier skill after code generation. The skill spins up the rendered application and runs verification flows automatically. Deterministic checks run first, which read computed cascading style sheets directly from the DOM and catch visual deviations immediately without the involvement of artificial intelligence, then behavioral checks follow. The agent walks user flows end-to-end, interacting with the application in the same way a human tester would catch functional regressions.
This combination of generation and verification helped us build higher accuracy web apps for our own use. In fact, we used our own tools to build our annotation extension and internal tools.
Nova Act Visual Verifier Skill — open source today.
Build with us
At the Amazon AGI Labs, annotation and verification are two primitives we found useful in our team. We're building the perception agent harness in the open because these interaction patterns get better with more people using them, breaking them, and building on top of them.
Try them today and tell us what’s missing. We’d like to build these out and determine which primitives should be next with you.