Skip to content

Browser Actions

Browser actions are the building blocks for web automation.

Overview

Available actions:

  • Navigation
  • Click interactions
  • Text input
  • Content extraction
  • Screenshots
  • Scrolling
  • Waiting

Accessibility Tree Snapshots

Instead of relying on screenshots as the primary page representation, the browser now uses text-based accessibility tree snapshots. This approach is inspired by OpenClaw and provides a structured, token-efficient view of the page that the agent can reason about directly.

How It Works

When a snapshot is taken, the browser injects JavaScript that walks the DOM, assigns data-ref="N" attributes to interactive elements, and returns a structured text representation of the page.

The format uses two styles:

  • Interactive elements: [N] role "label" attrs -- where N is the ref number used for targeting
  • Landmarks and headings: - role "label" -- structural elements that provide context

Example snapshot output:

- heading "Search Results"
[1] link "Home"
[2] textbox "Search..." focused
[3] button "Search"
- main "Results"
  [4] link "First Result - Example Page"
  [5] link "Second Result - Another Page"
[6] button "Next Page"

Page text:
First Result - Example Page
A description of the first result found...
Second Result - Another Page
More details about the second result...

The snapshot also includes a page text content section at the bottom, which captures readable text from the page (search results, articles, etc.).

Getting a Snapshot

Use the snapshot action to get or refresh the accessibility tree:

json
{
  "action": "snapshot"
}

All other actions (click, type, press, scroll) automatically return a fresh snapshot in their response, so you always have an up-to-date view of the page after every interaction.

Ref-Based Targeting

Instead of writing CSS selectors, you can target elements by their ref number from the snapshot. The ref number resolves to a [data-ref="N"] CSS selector internally.

json
{
  "action": "click",
  "ref": 5
}

This is equivalent to { "action": "click", "selector": "[data-ref='5']" } but is shorter and less error-prone. Ref numbers are reassigned on each snapshot, so always use refs from the most recent snapshot.

You can use ref-based targeting with click, type, and other element-targeting actions:

json
{
  "action": "type",
  "ref": 2,
  "text": "search query"
}

Token Efficiency

Accessibility tree snapshots are dramatically more efficient than screenshots:

RepresentationTypical Size
Screenshot (PNG)~5 MB
Snapshot (text)~5-50 KB
Reduction~100x

This makes it practical to include page state in every tool response without blowing up context windows.

Vision Model Fallback

For pages where the accessibility tree does not capture sufficient information (e.g., canvas-based applications, complex visual layouts, or CAPTCHA challenges), the agent can fall back to screenshot-based analysis using a vision model. The vision model is configurable via settings.llm.visionModel and follows an auto-fallback chain: configured vision model, then the main LLM if it supports vision, then Llama 4 Scout as a last resort.

Action Types

Navigate to a URL:

json
{
  "action": "navigate",
  "url": "https://example.com"
}

Click

Click an element:

json
{
  "action": "click",
  "selector": "#submit-button"
}

Options:

json
{
  "action": "click",
  "selector": ".menu-item",
  "button": "left",
  "click_count": 2,
  "delay": 100
}

Type

Enter text:

json
{
  "action": "type",
  "selector": "#email-input",
  "text": "[email protected]"
}

With options:

json
{
  "action": "type",
  "selector": "#search",
  "text": "search query",
  "delay": 50,
  "clear": true
}

Press

Press keyboard keys:

json
{
  "action": "press",
  "key": "Enter"
}

Common keys:

  • Enter, Tab, Escape
  • ArrowUp, ArrowDown, ArrowLeft, ArrowRight
  • Backspace, Delete
  • Modifiers: Shift+Enter, Control+a

Extract

Extract content from page:

json
{
  "action": "extract",
  "selector": ".product-price",
  "attribute": "text"
}

Extract types:

json
{
  "action": "extract",
  "selector": "img.hero",
  "attribute": "src"
}

Multiple elements:

json
{
  "action": "extract",
  "selector": ".item-title",
  "multiple": true
}

Screenshot

Capture screenshot:

json
{
  "action": "screenshot"
}

With options:

json
{
  "action": "screenshot",
  "selector": "#main-content",
  "full_page": false,
  "format": "png"
}

Scroll

Scroll the page:

json
{
  "action": "scroll",
  "direction": "down",
  "amount": 500
}

Scroll to element:

json
{
  "action": "scroll",
  "selector": "#footer",
  "behavior": "smooth"
}

Wait

Wait for conditions:

json
{
  "action": "wait",
  "selector": ".loading",
  "state": "hidden"
}

Wait types:

json
// Wait for element
{
  "action": "wait",
  "selector": "#content",
  "state": "visible",
  "timeout": 10000
}

// Wait for time
{
  "action": "wait",
  "time": 2000
}

// Wait for navigation
{
  "action": "wait",
  "type": "navigation"
}

Executing Actions

Via API

bash
curl -X POST https://your-domain.com/api/agents/{id}/browser/{sessionId}/execute \
  -H "Content-Type: application/json" \
  -d '{
    "action": "click",
    "selector": "#login-button"
  }'

Response:

json
{
  "success": true,
  "result": null,
  "duration_ms": 150
}

Via Workflow

json
{
  "nodes": [
    {
      "id": "nav-1",
      "type": "navigate",
      "data": { "url": "https://example.com" }
    },
    {
      "id": "click-1",
      "type": "click",
      "data": { "selector": "#login" }
    },
    {
      "id": "type-1",
      "type": "type",
      "data": {
        "selector": "#email",
        "text": "{{credentials.email}}"
      }
    }
  ]
}

Selectors

CSS Selectors

json
// By ID
{ "selector": "#my-element" }

// By class
{ "selector": ".my-class" }

// By attribute
{ "selector": "[data-testid='submit']" }

// By tag
{ "selector": "button" }

// Combined
{ "selector": "form#login input[type='email']" }

XPath Selectors

json
{
  "selector": "//button[contains(text(), 'Submit')]",
  "selector_type": "xpath"
}

Text Selectors

json
{
  "selector": "text=Click here",
  "selector_type": "text"
}

Action Chaining

Execute multiple actions:

bash
curl -X POST .../browser/{sessionId}/execute \
  -d '{
    "actions": [
      { "action": "navigate", "url": "https://example.com" },
      { "action": "wait", "selector": "#content" },
      { "action": "click", "selector": "#menu" },
      { "action": "screenshot" }
    ]
  }'

Error Handling

Common Errors

ErrorCauseSolution
ElementNotFoundSelector doesn't matchVerify selector, wait for element
TimeoutAction took too longIncrease timeout
NavigationFailedPage failed to loadCheck URL, retry
ClickInterceptedElement blockedWait, scroll into view

Retry Logic

json
{
  "action": "click",
  "selector": "#button",
  "retry": {
    "attempts": 3,
    "delay": 1000
  }
}

Advanced Actions

Hover

json
{
  "action": "hover",
  "selector": ".dropdown-trigger"
}

Select

Select dropdown option:

json
{
  "action": "select",
  "selector": "#country",
  "value": "US"
}

Upload

Upload file:

json
{
  "action": "upload",
  "selector": "input[type='file']",
  "file_path": "/path/to/file.pdf"
}

Evaluate

Execute JavaScript:

json
{
  "action": "evaluate",
  "script": "document.querySelector('#count').textContent"
}

Best Practices

1. Use Stable Selectors

Prefer:

  • IDs and data attributes
  • Semantic selectors

Avoid:

  • Dynamic classes
  • Position-based selectors

2. Wait for Elements

Always wait before interacting:

json
[
  { "action": "wait", "selector": "#form" },
  { "action": "type", "selector": "#email", "text": "..." }
]

3. Handle Dynamic Content

Wait for content to load:

json
{
  "action": "wait",
  "selector": ".loading",
  "state": "hidden"
}

4. Set Appropriate Timeouts

Adjust for slow pages:

json
{
  "action": "wait",
  "selector": "#data-table",
  "timeout": 30000
}

5. Validate Results

Check action results:

json
{
  "action": "extract",
  "selector": ".success-message",
  "validate": true
}

API Reference

See Browser API for complete endpoint documentation.