Data Sources

Data sources provide knowledge for your AI agents through RAG (Retrieval Augmented Generation).

Overview

Source types:

File - Documents, PDFs, text files
Website Crawl - Crawl entire websites with BFS traversal and delta detection
Database - External databases
API - REST/GraphQL endpoints

Adding Sources

Via Dashboard

Navigate to Context
Click Add Source
Select source type
Configure and upload

Via API

bash

# File upload
curl -X POST https://your-domain.com/api/agents/{id}/sources \
  -F "type=file" \
  -F "name=Product Manual" \
  -F "file=@/path/to/manual.pdf"

# Crawl website
curl -X POST https://your-domain.com/api/agents/{id}/sources/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "name": "Documentation",
    "maxPages": 50,
    "maxDepth": 3,
    "includePatterns": ["/docs/*"],
    "excludePatterns": ["/blog/*"]
  }'

Source Types

File

Supported formats:

PDF (.pdf)
Text (.txt)
Markdown (.md)
Word (.docx)
JSON (.json)
CSV (.csv)

json

{
  "type": "file",
  "name": "User Guide",
  "file": "<binary>"
}

Crawl Website

Crawl an entire website with automatic page discovery and delta detection. This replaces the old single-page URL source type.

json

{
  "url": "https://help.example.com",
  "name": "Help Center",
  "maxPages": 30,
  "maxDepth": 3,
  "includePatterns": ["/docs/*", "/guides/*"],
  "excludePatterns": ["/blog/*", "/changelog/*"]
}

Parameters:

Parameter	Type	Default	Description
`url`	string	(required)	Starting URL for the crawl
`name`	string	(required)	Display name for the source
`maxPages`	number	50	Maximum pages to crawl (up to 50)
`maxDepth`	number	3	Maximum link depth from start URL (up to 5)
`includePatterns`	string[]	`[]`	Glob patterns — only crawl matching URLs
`excludePatterns`	string[]	`[]`	Glob patterns — skip matching URLs

See Website Crawling below for full details on how crawling works.

Database

Connect to databases:

json

{
  "type": "database",
  "name": "Product Data",
  "config": {
    "connection_string": "postgresql://...",
    "query": "SELECT * FROM products"
  }
}

API

Fetch from APIs:

json

{
  "type": "api",
  "name": "CRM Contacts",
  "config": {
    "url": "https://api.crm.com/contacts",
    "method": "GET",
    "headers": {
      "Authorization": "Bearer {{api_key}}"
    },
    "schedule": "0 * * * *"
  }
}

Website Crawling

Website crawling lets your agent ingest entire sites as knowledge sources. It uses Cloudflare Browser Rendering to fetch pages and follows links to discover content automatically.

How It Works

The crawler uses breadth-first search (BFS) traversal powered by two Cloudflare Browser Rendering REST endpoints:

/markdown — Renders each page and extracts its content as clean Markdown
/links — Extracts all links from the page for discovery of new URLs

Starting from the provided URL, the crawler visits each page, converts it to Markdown for indexing, then follows discovered links up to the configured maxDepth and maxPages limits. The crawl runs in the background via ctx.waitUntil, so it does not block the API response or the agent conversation.

Note: The crawler uses the synchronous /markdown and /links endpoints, not the async /crawl endpoint, which is unreliable.

Configuration

When creating a crawl source, you can control scope with these parameters:

maxPages (up to 50) — Hard cap on total pages crawled. Prevents runaway crawls on large sites.
maxDepth (up to 5) — How many link hops from the starting URL. Depth 1 means only pages directly linked from the start URL.
includePatterns — Glob patterns that URLs must match to be crawled. Useful for restricting to specific sections (e.g., ["/docs/*"]).
excludePatterns — Glob patterns for URLs to skip. Useful for excluding blogs, changelogs, or auth pages.

Delta Detection

The crawler computes a SHA-256 hash of each page's Markdown content. On subsequent crawls (re-crawl or reindex), pages whose hash matches the previous crawl are skipped entirely. Only changed or new pages are re-processed and re-embedded.

This makes re-crawling efficient: if a 40-page site has 3 pages that changed, only those 3 pages are re-indexed.

Re-Crawling

Triggering a reindex on a crawl source performs a full re-crawl with delta detection:

bash

curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/reindex

The re-crawl follows the same BFS traversal and respects the original maxPages, maxDepth, and pattern settings. Pages are compared by hash, and only changed/new pages are updated in the vector index.

LLM Tools

The agent has two built-in tools for crawling during conversations:

web_crawl — Initiates a website crawl with the same parameters (url, maxPages, maxDepth, includePatterns, excludePatterns). The crawl runs in the background and creates a new source.
get_crawl_status — Checks the progress of an active crawl, returning page count, status, and any errors.

Example agent interaction:

User: "Crawl our help center at help.example.com and learn about our products"

Agent: [Calls web_crawl with url="https://help.example.com", maxPages=30]
       "I've started crawling your help center. Let me check the progress..."
       [Calls get_crawl_status]
       "The crawl found 24 pages and is now indexing them. I'll be able to
        answer questions about your products once it finishes."

Crawl Metadata

The dashboard displays crawl-specific metadata for website sources:

Page count — Total number of pages discovered and indexed
Last crawled at — Timestamp of the most recent crawl
Changed pages — Number of pages that were new or modified on the last re-crawl
Unchanged pages — Number of pages skipped due to matching content hash

Processing Pipeline

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Source    │────▶│   Extract    │────▶│    Chunk     │
│    Input     │     │   Content    │     │    Text      │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Store in   │◀────│   Generate   │◀────│   Clean &    │
│  Vectorize   │     │  Embeddings  │     │   Normalize  │
└──────────────┘     └──────────────┘     └──────────────┘

Source Status

Status	Description
`pending`	Waiting to process
`processing`	Currently processing
`crawling`	Website crawl in progress
`ready`	Available for search
`error`	Processing failed
`updating`	Refreshing content

Chunking Strategy

Content is split into searchable chunks:

json

{
  "chunking": {
    "method": "semantic",
    "max_size": 1000,
    "overlap": 200
  }
}

Methods:

semantic - Smart paragraph splitting
fixed - Fixed character count
sentence - Sentence boundaries

Managing Sources

List Sources

bash

curl https://your-domain.com/api/agents/{id}/sources

Get Source Details

bash

curl https://your-domain.com/api/agents/{id}/sources/{sourceId}

Delete Source

bash

curl -X DELETE https://your-domain.com/api/agents/{id}/sources/{sourceId}

Reindex Source

bash

curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/reindex

For crawl sources, this triggers a full re-crawl with delta detection (only changed pages are re-indexed).

Storage

R2 Storage

Files are stored in Cloudflare R2:

Automatic replication
No egress fees
Unlimited storage

Vectorize Index

Embeddings stored in Vectorize:

Fast similarity search
Automatic indexing
Scalable to millions

Integration

With Chat

Sources are automatically searched:

User: "What's the return policy?"

Agent: [Searches sources] → [Finds relevant chunks] → [Generates response]

With Workflows

json

{
  "type": "search-sources",
  "data": {
    "query": "{{input.question}}",
    "limit": 5
  }
}

Best Practices

1. Organize Sources

Group related content:

Product documentation
FAQ and support
Policies and terms

2. Keep Content Fresh

Schedule regular updates:

json

{
  "refresh_schedule": "0 0 * * *"
}

3. Scope Crawls with Patterns

Use include/exclude patterns to focus crawls on relevant content:

json

{
  "includePatterns": ["/docs/*", "/help/*"],
  "excludePatterns": ["/blog/*", "/admin/*", "/login*"]
}

4. Optimize Chunk Size

Balance context and precision:

Larger chunks: More context
Smaller chunks: Higher precision

5. Use Metadata

Add descriptive metadata:

json

{
  "metadata": {
    "category": "support",
    "version": "2.0",
    "language": "en"
  }
}

6. Monitor Quality

Review search results:

Check relevance
Update stale content
Remove duplicates

API Reference

See Sources API for complete endpoint documentation.

Data Sources ​

Overview ​

Adding Sources ​

Via Dashboard ​

Via API ​

Source Types ​

File ​

Crawl Website ​

Database ​

API ​

Website Crawling ​

How It Works ​

Configuration ​

Delta Detection ​

Re-Crawling ​

LLM Tools ​

Crawl Metadata ​

Processing Pipeline ​

Source Status ​

Chunking Strategy ​

Managing Sources ​

List Sources ​

Get Source Details ​

Delete Source ​

Reindex Source ​

Storage ​

R2 Storage ​

Vectorize Index ​

Integration ​

With Chat ​

With Workflows ​

Best Practices ​

1. Organize Sources ​

2. Keep Content Fresh ​

3. Scope Crawls with Patterns ​

4. Optimize Chunk Size ​

5. Use Metadata ​

6. Monitor Quality ​

API Reference ​

Data Sources

Overview

Adding Sources

Via Dashboard

Via API

Source Types

File

Crawl Website

Database

API

Website Crawling

How It Works

Configuration

Delta Detection

Re-Crawling

LLM Tools

Crawl Metadata

Processing Pipeline

Source Status

Chunking Strategy

Managing Sources

List Sources

Get Source Details

Delete Source

Reindex Source

Storage

R2 Storage

Vectorize Index

Integration

With Chat

With Workflows

Best Practices

1. Organize Sources

2. Keep Content Fresh

3. Scope Crawls with Patterns

4. Optimize Chunk Size

5. Use Metadata

6. Monitor Quality

API Reference