Appearance
Browser Actions
Browser actions are the building blocks for web automation.
Overview
Available actions:
- Navigation
- Click interactions
- Text input
- Content extraction
- Screenshots
- Scrolling
- Waiting
Accessibility Tree Snapshots
Instead of relying on screenshots as the primary page representation, the browser now uses text-based accessibility tree snapshots. This approach is inspired by OpenClaw and provides a structured, token-efficient view of the page that the agent can reason about directly.
How It Works
When a snapshot is taken, the browser injects JavaScript that walks the DOM, assigns data-ref="N" attributes to interactive elements, and returns a structured text representation of the page.
The format uses two styles:
- Interactive elements:
[N] role "label" attrs-- whereNis the ref number used for targeting - Landmarks and headings:
- role "label"-- structural elements that provide context
Example snapshot output:
- heading "Search Results"
[1] link "Home"
[2] textbox "Search..." focused
[3] button "Search"
- main "Results"
[4] link "First Result - Example Page"
[5] link "Second Result - Another Page"
[6] button "Next Page"
Page text:
First Result - Example Page
A description of the first result found...
Second Result - Another Page
More details about the second result...The snapshot also includes a page text content section at the bottom, which captures readable text from the page (search results, articles, etc.).
Getting a Snapshot
Use the snapshot action to get or refresh the accessibility tree:
json
{
"action": "snapshot"
}All other actions (click, type, press, scroll) automatically return a fresh snapshot in their response, so you always have an up-to-date view of the page after every interaction.
Ref-Based Targeting
Instead of writing CSS selectors, you can target elements by their ref number from the snapshot. The ref number resolves to a [data-ref="N"] CSS selector internally.
json
{
"action": "click",
"ref": 5
}This is equivalent to { "action": "click", "selector": "[data-ref='5']" } but is shorter and less error-prone. Ref numbers are reassigned on each snapshot, so always use refs from the most recent snapshot.
You can use ref-based targeting with click, type, and other element-targeting actions:
json
{
"action": "type",
"ref": 2,
"text": "search query"
}Token Efficiency
Accessibility tree snapshots are dramatically more efficient than screenshots:
| Representation | Typical Size |
|---|---|
| Screenshot (PNG) | ~5 MB |
| Snapshot (text) | ~5-50 KB |
| Reduction | ~100x |
This makes it practical to include page state in every tool response without blowing up context windows.
Vision Model Fallback
For pages where the accessibility tree does not capture sufficient information (e.g., canvas-based applications, complex visual layouts, or CAPTCHA challenges), the agent can fall back to screenshot-based analysis using a vision model. The vision model is configurable via settings.llm.visionModel and follows an auto-fallback chain: configured vision model, then the main LLM if it supports vision, then Llama 4 Scout as a last resort.
Action Types
Navigate
Navigate to a URL:
json
{
"action": "navigate",
"url": "https://example.com"
}Click
Click an element:
json
{
"action": "click",
"selector": "#submit-button"
}Options:
json
{
"action": "click",
"selector": ".menu-item",
"button": "left",
"click_count": 2,
"delay": 100
}Type
Enter text:
json
{
"action": "type",
"selector": "#email-input",
"text": "[email protected]"
}With options:
json
{
"action": "type",
"selector": "#search",
"text": "search query",
"delay": 50,
"clear": true
}Press
Press keyboard keys:
json
{
"action": "press",
"key": "Enter"
}Common keys:
Enter,Tab,EscapeArrowUp,ArrowDown,ArrowLeft,ArrowRightBackspace,Delete- Modifiers:
Shift+Enter,Control+a
Extract
Extract content from page:
json
{
"action": "extract",
"selector": ".product-price",
"attribute": "text"
}Extract types:
json
{
"action": "extract",
"selector": "img.hero",
"attribute": "src"
}Multiple elements:
json
{
"action": "extract",
"selector": ".item-title",
"multiple": true
}Screenshot
Capture screenshot:
json
{
"action": "screenshot"
}With options:
json
{
"action": "screenshot",
"selector": "#main-content",
"full_page": false,
"format": "png"
}Scroll
Scroll the page:
json
{
"action": "scroll",
"direction": "down",
"amount": 500
}Scroll to element:
json
{
"action": "scroll",
"selector": "#footer",
"behavior": "smooth"
}Wait
Wait for conditions:
json
{
"action": "wait",
"selector": ".loading",
"state": "hidden"
}Wait types:
json
// Wait for element
{
"action": "wait",
"selector": "#content",
"state": "visible",
"timeout": 10000
}
// Wait for time
{
"action": "wait",
"time": 2000
}
// Wait for navigation
{
"action": "wait",
"type": "navigation"
}Executing Actions
Via API
bash
curl -X POST https://your-domain.com/api/agents/{id}/browser/{sessionId}/execute \
-H "Content-Type: application/json" \
-d '{
"action": "click",
"selector": "#login-button"
}'Response:
json
{
"success": true,
"result": null,
"duration_ms": 150
}Via Workflow
json
{
"nodes": [
{
"id": "nav-1",
"type": "navigate",
"data": { "url": "https://example.com" }
},
{
"id": "click-1",
"type": "click",
"data": { "selector": "#login" }
},
{
"id": "type-1",
"type": "type",
"data": {
"selector": "#email",
"text": "{{credentials.email}}"
}
}
]
}Selectors
CSS Selectors
json
// By ID
{ "selector": "#my-element" }
// By class
{ "selector": ".my-class" }
// By attribute
{ "selector": "[data-testid='submit']" }
// By tag
{ "selector": "button" }
// Combined
{ "selector": "form#login input[type='email']" }XPath Selectors
json
{
"selector": "//button[contains(text(), 'Submit')]",
"selector_type": "xpath"
}Text Selectors
json
{
"selector": "text=Click here",
"selector_type": "text"
}Action Chaining
Execute multiple actions:
bash
curl -X POST .../browser/{sessionId}/execute \
-d '{
"actions": [
{ "action": "navigate", "url": "https://example.com" },
{ "action": "wait", "selector": "#content" },
{ "action": "click", "selector": "#menu" },
{ "action": "screenshot" }
]
}'Error Handling
Common Errors
| Error | Cause | Solution |
|---|---|---|
ElementNotFound | Selector doesn't match | Verify selector, wait for element |
Timeout | Action took too long | Increase timeout |
NavigationFailed | Page failed to load | Check URL, retry |
ClickIntercepted | Element blocked | Wait, scroll into view |
Retry Logic
json
{
"action": "click",
"selector": "#button",
"retry": {
"attempts": 3,
"delay": 1000
}
}Advanced Actions
Hover
json
{
"action": "hover",
"selector": ".dropdown-trigger"
}Select
Select dropdown option:
json
{
"action": "select",
"selector": "#country",
"value": "US"
}Upload
Upload file:
json
{
"action": "upload",
"selector": "input[type='file']",
"file_path": "/path/to/file.pdf"
}Evaluate
Execute JavaScript:
json
{
"action": "evaluate",
"script": "document.querySelector('#count').textContent"
}Best Practices
1. Use Stable Selectors
Prefer:
- IDs and data attributes
- Semantic selectors
Avoid:
- Dynamic classes
- Position-based selectors
2. Wait for Elements
Always wait before interacting:
json
[
{ "action": "wait", "selector": "#form" },
{ "action": "type", "selector": "#email", "text": "..." }
]3. Handle Dynamic Content
Wait for content to load:
json
{
"action": "wait",
"selector": ".loading",
"state": "hidden"
}4. Set Appropriate Timeouts
Adjust for slow pages:
json
{
"action": "wait",
"selector": "#data-table",
"timeout": 30000
}5. Validate Results
Check action results:
json
{
"action": "extract",
"selector": ".success-message",
"validate": true
}API Reference
See Browser API for complete endpoint documentation.
