Browser Actions

Browser actions are the building blocks for web automation.

Overview

Available actions:

Navigation
Click interactions
Text input
Content extraction
Screenshots
Scrolling
Waiting

Accessibility Tree Snapshots

Instead of relying on screenshots as the primary page representation, the browser now uses text-based accessibility tree snapshots. This approach is inspired by OpenClaw and provides a structured, token-efficient view of the page that the agent can reason about directly.

How It Works

When a snapshot is taken, the browser injects JavaScript that walks the DOM, assigns data-ref="N" attributes to interactive elements, and returns a structured text representation of the page.

The format uses two styles:

Interactive elements: [N] role "label" attrs -- where N is the ref number used for targeting
Landmarks and headings: - role "label" -- structural elements that provide context

Example snapshot output:

- heading "Search Results"
[1] link "Home"
[2] textbox "Search..." focused
[3] button "Search"
- main "Results"
  [4] link "First Result - Example Page"
  [5] link "Second Result - Another Page"
[6] button "Next Page"

Page text:
First Result - Example Page
A description of the first result found...
Second Result - Another Page
More details about the second result...

The snapshot also includes a page text content section at the bottom, which captures readable text from the page (search results, articles, etc.).

Getting a Snapshot

Use the snapshot action to get or refresh the accessibility tree:

json

{
  "action": "snapshot"
}

All other actions (click, type, press, scroll) automatically return a fresh snapshot in their response, so you always have an up-to-date view of the page after every interaction.

Ref-Based Targeting

Instead of writing CSS selectors, you can target elements by their ref number from the snapshot. The ref number resolves to a [data-ref="N"] CSS selector internally.

json

{
  "action": "click",
  "ref": 5
}

This is equivalent to { "action": "click", "selector": "[data-ref='5']" } but is shorter and less error-prone. Ref numbers are reassigned on each snapshot, so always use refs from the most recent snapshot.

You can use ref-based targeting with click, type, and other element-targeting actions:

json

{
  "action": "type",
  "ref": 2,
  "text": "search query"
}

Token Efficiency

Accessibility tree snapshots are dramatically more efficient than screenshots:

Representation	Typical Size
Screenshot (PNG)	~5 MB
Snapshot (text)	~5-50 KB
Reduction	~100x

This makes it practical to include page state in every tool response without blowing up context windows.

Vision Model Fallback

For pages where the accessibility tree does not capture sufficient information (e.g., canvas-based applications, complex visual layouts, or CAPTCHA challenges), the agent can fall back to screenshot-based analysis using a vision model. The vision model is configurable via settings.llm.visionModel and follows an auto-fallback chain: configured vision model, then the main LLM if it supports vision, then Llama 4 Scout as a last resort.

Action Types

Navigate

Navigate to a URL:

json

{
  "action": "navigate",
  "url": "https://example.com"
}

Click

Click an element:

json

{
  "action": "click",
  "selector": "#submit-button"
}

Options:

json

{
  "action": "click",
  "selector": ".menu-item",
  "button": "left",
  "click_count": 2,
  "delay": 100
}

Type

Enter text:

json

{
  "action": "type",
  "selector": "#email-input",
  "text": "[email protected]"
}

With options:

json

{
  "action": "type",
  "selector": "#search",
  "text": "search query",
  "delay": 50,
  "clear": true
}

Press

Press keyboard keys:

json

{
  "action": "press",
  "key": "Enter"
}

Common keys:

Enter, Tab, Escape
ArrowUp, ArrowDown, ArrowLeft, ArrowRight
Backspace, Delete
Modifiers: Shift+Enter, Control+a

Extract

Extract content from page:

json

{
  "action": "extract",
  "selector": ".product-price",
  "attribute": "text"
}

Extract types:

json

{
  "action": "extract",
  "selector": "img.hero",
  "attribute": "src"
}

Multiple elements:

json

{
  "action": "extract",
  "selector": ".item-title",
  "multiple": true
}

Screenshot

Capture screenshot:

json

{
  "action": "screenshot"
}

With options:

json

{
  "action": "screenshot",
  "selector": "#main-content",
  "full_page": false,
  "format": "png"
}

Scroll

Scroll the page:

json

{
  "action": "scroll",
  "direction": "down",
  "amount": 500
}

Scroll to element:

json

{
  "action": "scroll",
  "selector": "#footer",
  "behavior": "smooth"
}

Wait

Wait for conditions:

json

{
  "action": "wait",
  "selector": ".loading",
  "state": "hidden"
}

Wait types:

json

// Wait for element
{
  "action": "wait",
  "selector": "#content",
  "state": "visible",
  "timeout": 10000
}

// Wait for time
{
  "action": "wait",
  "time": 2000
}

// Wait for navigation
{
  "action": "wait",
  "type": "navigation"
}

Executing Actions

Via API

bash

curl -X POST https://your-domain.com/api/agents/{id}/browser/{sessionId}/execute \
  -H "Content-Type: application/json" \
  -d '{
    "action": "click",
    "selector": "#login-button"
  }'

Response:

json

{
  "success": true,
  "result": null,
  "duration_ms": 150
}

Via Workflow

json

{
  "nodes": [
    {
      "id": "nav-1",
      "type": "navigate",
      "data": { "url": "https://example.com" }
    },
    {
      "id": "click-1",
      "type": "click",
      "data": { "selector": "#login" }
    },
    {
      "id": "type-1",
      "type": "type",
      "data": {
        "selector": "#email",
        "text": "{{credentials.email}}"
      }
    }
  ]
}

Selectors

CSS Selectors

json

// By ID
{ "selector": "#my-element" }

// By class
{ "selector": ".my-class" }

// By attribute
{ "selector": "[data-testid='submit']" }

// By tag
{ "selector": "button" }

// Combined
{ "selector": "form#login input[type='email']" }

XPath Selectors

json

{
  "selector": "//button[contains(text(), 'Submit')]",
  "selector_type": "xpath"
}

Text Selectors

json

{
  "selector": "text=Click here",
  "selector_type": "text"
}

Action Chaining

Execute multiple actions:

bash

curl -X POST .../browser/{sessionId}/execute \
  -d '{
    "actions": [
      { "action": "navigate", "url": "https://example.com" },
      { "action": "wait", "selector": "#content" },
      { "action": "click", "selector": "#menu" },
      { "action": "screenshot" }
    ]
  }'

Error Handling

Common Errors

Error	Cause	Solution
`ElementNotFound`	Selector doesn't match	Verify selector, wait for element
`Timeout`	Action took too long	Increase timeout
`NavigationFailed`	Page failed to load	Check URL, retry
`ClickIntercepted`	Element blocked	Wait, scroll into view

Retry Logic

json

{
  "action": "click",
  "selector": "#button",
  "retry": {
    "attempts": 3,
    "delay": 1000
  }
}

Advanced Actions

Hover

json

{
  "action": "hover",
  "selector": ".dropdown-trigger"
}

Select

Select dropdown option:

json

{
  "action": "select",
  "selector": "#country",
  "value": "US"
}

Upload

Upload file:

json

{
  "action": "upload",
  "selector": "input[type='file']",
  "file_path": "/path/to/file.pdf"
}

Evaluate

Execute JavaScript:

json

{
  "action": "evaluate",
  "script": "document.querySelector('#count').textContent"
}

Best Practices

1. Use Stable Selectors

Prefer:

IDs and data attributes
Semantic selectors

Avoid:

Dynamic classes
Position-based selectors

2. Wait for Elements

Always wait before interacting:

json

[
  { "action": "wait", "selector": "#form" },
  { "action": "type", "selector": "#email", "text": "..." }
]

3. Handle Dynamic Content

Wait for content to load:

json

{
  "action": "wait",
  "selector": ".loading",
  "state": "hidden"
}

4. Set Appropriate Timeouts

Adjust for slow pages:

json

{
  "action": "wait",
  "selector": "#data-table",
  "timeout": 30000
}

5. Validate Results

Check action results:

json

{
  "action": "extract",
  "selector": ".success-message",
  "validate": true
}

API Reference

See Browser API for complete endpoint documentation.

Browser Actions ​

Overview ​

Accessibility Tree Snapshots ​

How It Works ​

Getting a Snapshot ​

Ref-Based Targeting ​

Token Efficiency ​

Vision Model Fallback ​

Action Types ​

Navigate ​

Click ​

Type ​

Press ​

Extract ​

Screenshot ​

Scroll ​

Wait ​

Executing Actions ​

Via API ​

Via Workflow ​

Selectors ​

CSS Selectors ​

XPath Selectors ​

Text Selectors ​

Action Chaining ​

Error Handling ​

Common Errors ​

Retry Logic ​

Advanced Actions ​

Hover ​

Select ​

Upload ​

Evaluate ​

Best Practices ​

1. Use Stable Selectors ​

2. Wait for Elements ​

3. Handle Dynamic Content ​

4. Set Appropriate Timeouts ​

5. Validate Results ​

API Reference ​

Browser Actions

Overview

Accessibility Tree Snapshots

How It Works

Getting a Snapshot

Ref-Based Targeting

Token Efficiency

Vision Model Fallback

Action Types

Navigate

Click

Type

Press

Extract

Screenshot

Scroll

Wait

Executing Actions

Via API

Via Workflow

Selectors

CSS Selectors

XPath Selectors

Text Selectors

Action Chaining

Error Handling

Common Errors

Retry Logic

Advanced Actions

Hover

Select

Upload

Evaluate

Best Practices

1. Use Stable Selectors

2. Wait for Elements

3. Handle Dynamic Content

4. Set Appropriate Timeouts

5. Validate Results

API Reference