Skip to content

Data Sources

Data sources provide knowledge for your AI agents through RAG (Retrieval Augmented Generation).

Overview

Source types:

  • File - Documents, PDFs, text files
  • Website Crawl - Crawl entire websites with BFS traversal and delta detection
  • Database - External databases
  • API - REST/GraphQL endpoints

Adding Sources

Via Dashboard

  1. Navigate to Context
  2. Click Add Source
  3. Select source type
  4. Configure and upload

Via API

bash
# File upload
curl -X POST https://your-domain.com/api/agents/{id}/sources \
  -F "type=file" \
  -F "name=Product Manual" \
  -F "file=@/path/to/manual.pdf"

# Crawl website
curl -X POST https://your-domain.com/api/agents/{id}/sources/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "name": "Documentation",
    "maxPages": 50,
    "maxDepth": 3,
    "includePatterns": ["/docs/*"],
    "excludePatterns": ["/blog/*"]
  }'

Source Types

File

Supported formats:

  • PDF (.pdf)
  • Text (.txt)
  • Markdown (.md)
  • Word (.docx)
  • JSON (.json)
  • CSV (.csv)
json
{
  "type": "file",
  "name": "User Guide",
  "file": "<binary>"
}

Crawl Website

Crawl an entire website with automatic page discovery and delta detection. This replaces the old single-page URL source type.

json
{
  "url": "https://help.example.com",
  "name": "Help Center",
  "maxPages": 30,
  "maxDepth": 3,
  "includePatterns": ["/docs/*", "/guides/*"],
  "excludePatterns": ["/blog/*", "/changelog/*"]
}

Parameters:

ParameterTypeDefaultDescription
urlstring(required)Starting URL for the crawl
namestring(required)Display name for the source
maxPagesnumber50Maximum pages to crawl (up to 50)
maxDepthnumber3Maximum link depth from start URL (up to 5)
includePatternsstring[][]Glob patterns — only crawl matching URLs
excludePatternsstring[][]Glob patterns — skip matching URLs

See Website Crawling below for full details on how crawling works.

Database

Connect to databases:

json
{
  "type": "database",
  "name": "Product Data",
  "config": {
    "connection_string": "postgresql://...",
    "query": "SELECT * FROM products"
  }
}

API

Fetch from APIs:

json
{
  "type": "api",
  "name": "CRM Contacts",
  "config": {
    "url": "https://api.crm.com/contacts",
    "method": "GET",
    "headers": {
      "Authorization": "Bearer {{api_key}}"
    },
    "schedule": "0 * * * *"
  }
}

Website Crawling

Website crawling lets your agent ingest entire sites as knowledge sources. It uses Cloudflare Browser Rendering to fetch pages and follows links to discover content automatically.

How It Works

The crawler uses breadth-first search (BFS) traversal powered by two Cloudflare Browser Rendering REST endpoints:

  1. /markdown — Renders each page and extracts its content as clean Markdown
  2. /links — Extracts all links from the page for discovery of new URLs

Starting from the provided URL, the crawler visits each page, converts it to Markdown for indexing, then follows discovered links up to the configured maxDepth and maxPages limits. The crawl runs in the background via ctx.waitUntil, so it does not block the API response or the agent conversation.

Note: The crawler uses the synchronous /markdown and /links endpoints, not the async /crawl endpoint, which is unreliable.

Configuration

When creating a crawl source, you can control scope with these parameters:

  • maxPages (up to 50) — Hard cap on total pages crawled. Prevents runaway crawls on large sites.
  • maxDepth (up to 5) — How many link hops from the starting URL. Depth 1 means only pages directly linked from the start URL.
  • includePatterns — Glob patterns that URLs must match to be crawled. Useful for restricting to specific sections (e.g., ["/docs/*"]).
  • excludePatterns — Glob patterns for URLs to skip. Useful for excluding blogs, changelogs, or auth pages.

Delta Detection

The crawler computes a SHA-256 hash of each page's Markdown content. On subsequent crawls (re-crawl or reindex), pages whose hash matches the previous crawl are skipped entirely. Only changed or new pages are re-processed and re-embedded.

This makes re-crawling efficient: if a 40-page site has 3 pages that changed, only those 3 pages are re-indexed.

Re-Crawling

Triggering a reindex on a crawl source performs a full re-crawl with delta detection:

bash
curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/reindex

The re-crawl follows the same BFS traversal and respects the original maxPages, maxDepth, and pattern settings. Pages are compared by hash, and only changed/new pages are updated in the vector index.

LLM Tools

The agent has two built-in tools for crawling during conversations:

  • web_crawl — Initiates a website crawl with the same parameters (url, maxPages, maxDepth, includePatterns, excludePatterns). The crawl runs in the background and creates a new source.
  • get_crawl_status — Checks the progress of an active crawl, returning page count, status, and any errors.

Example agent interaction:

User: "Crawl our help center at help.example.com and learn about our products"

Agent: [Calls web_crawl with url="https://help.example.com", maxPages=30]
       "I've started crawling your help center. Let me check the progress..."
       [Calls get_crawl_status]
       "The crawl found 24 pages and is now indexing them. I'll be able to
        answer questions about your products once it finishes."

Crawl Metadata

The dashboard displays crawl-specific metadata for website sources:

  • Page count — Total number of pages discovered and indexed
  • Last crawled at — Timestamp of the most recent crawl
  • Changed pages — Number of pages that were new or modified on the last re-crawl
  • Unchanged pages — Number of pages skipped due to matching content hash

Processing Pipeline

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Source    │────▶│   Extract    │────▶│    Chunk     │
│    Input     │     │   Content    │     │    Text      │
└──────────────┘     └──────────────┘     └──────┬───────┘


┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Store in   │◀────│   Generate   │◀────│   Clean &    │
│  Vectorize   │     │  Embeddings  │     │   Normalize  │
└──────────────┘     └──────────────┘     └──────────────┘

Source Status

StatusDescription
pendingWaiting to process
processingCurrently processing
crawlingWebsite crawl in progress
readyAvailable for search
errorProcessing failed
updatingRefreshing content

Chunking Strategy

Content is split into searchable chunks:

json
{
  "chunking": {
    "method": "semantic",
    "max_size": 1000,
    "overlap": 200
  }
}

Methods:

  • semantic - Smart paragraph splitting
  • fixed - Fixed character count
  • sentence - Sentence boundaries

Managing Sources

List Sources

bash
curl https://your-domain.com/api/agents/{id}/sources

Get Source Details

bash
curl https://your-domain.com/api/agents/{id}/sources/{sourceId}

Delete Source

bash
curl -X DELETE https://your-domain.com/api/agents/{id}/sources/{sourceId}

Reindex Source

bash
curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/reindex

For crawl sources, this triggers a full re-crawl with delta detection (only changed pages are re-indexed).

Storage

R2 Storage

Files are stored in Cloudflare R2:

  • Automatic replication
  • No egress fees
  • Unlimited storage

Vectorize Index

Embeddings stored in Vectorize:

  • Fast similarity search
  • Automatic indexing
  • Scalable to millions

Integration

With Chat

Sources are automatically searched:

User: "What's the return policy?"

Agent: [Searches sources] → [Finds relevant chunks] → [Generates response]

With Workflows

json
{
  "type": "search-sources",
  "data": {
    "query": "{{input.question}}",
    "limit": 5
  }
}

Best Practices

1. Organize Sources

Group related content:

  • Product documentation
  • FAQ and support
  • Policies and terms

2. Keep Content Fresh

Schedule regular updates:

json
{
  "refresh_schedule": "0 0 * * *"
}

3. Scope Crawls with Patterns

Use include/exclude patterns to focus crawls on relevant content:

json
{
  "includePatterns": ["/docs/*", "/help/*"],
  "excludePatterns": ["/blog/*", "/admin/*", "/login*"]
}

4. Optimize Chunk Size

Balance context and precision:

  • Larger chunks: More context
  • Smaller chunks: Higher precision

5. Use Metadata

Add descriptive metadata:

json
{
  "metadata": {
    "category": "support",
    "version": "2.0",
    "language": "en"
  }
}

6. Monitor Quality

Review search results:

  • Check relevance
  • Update stale content
  • Remove duplicates

API Reference

See Sources API for complete endpoint documentation.