Appearance
Data Sources
Data sources provide knowledge for your AI agents through RAG (Retrieval Augmented Generation).
Overview
Source types:
- File - Documents, PDFs, text files
- Website Crawl - Crawl entire websites with BFS traversal and delta detection
- Database - External databases
- API - REST/GraphQL endpoints
Adding Sources
Via Dashboard
- Navigate to Context
- Click Add Source
- Select source type
- Configure and upload
Via API
bash
# File upload
curl -X POST https://your-domain.com/api/agents/{id}/sources \
-F "type=file" \
-F "name=Product Manual" \
-F "file=@/path/to/manual.pdf"
# Crawl website
curl -X POST https://your-domain.com/api/agents/{id}/sources/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"name": "Documentation",
"maxPages": 50,
"maxDepth": 3,
"includePatterns": ["/docs/*"],
"excludePatterns": ["/blog/*"]
}'Source Types
File
Supported formats:
- PDF (.pdf)
- Text (.txt)
- Markdown (.md)
- Word (.docx)
- JSON (.json)
- CSV (.csv)
json
{
"type": "file",
"name": "User Guide",
"file": "<binary>"
}Crawl Website
Crawl an entire website with automatic page discovery and delta detection. This replaces the old single-page URL source type.
json
{
"url": "https://help.example.com",
"name": "Help Center",
"maxPages": 30,
"maxDepth": 3,
"includePatterns": ["/docs/*", "/guides/*"],
"excludePatterns": ["/blog/*", "/changelog/*"]
}Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | (required) | Starting URL for the crawl |
name | string | (required) | Display name for the source |
maxPages | number | 50 | Maximum pages to crawl (up to 50) |
maxDepth | number | 3 | Maximum link depth from start URL (up to 5) |
includePatterns | string[] | [] | Glob patterns — only crawl matching URLs |
excludePatterns | string[] | [] | Glob patterns — skip matching URLs |
See Website Crawling below for full details on how crawling works.
Database
Connect to databases:
json
{
"type": "database",
"name": "Product Data",
"config": {
"connection_string": "postgresql://...",
"query": "SELECT * FROM products"
}
}API
Fetch from APIs:
json
{
"type": "api",
"name": "CRM Contacts",
"config": {
"url": "https://api.crm.com/contacts",
"method": "GET",
"headers": {
"Authorization": "Bearer {{api_key}}"
},
"schedule": "0 * * * *"
}
}Website Crawling
Website crawling lets your agent ingest entire sites as knowledge sources. It uses Cloudflare Browser Rendering to fetch pages and follows links to discover content automatically.
How It Works
The crawler uses breadth-first search (BFS) traversal powered by two Cloudflare Browser Rendering REST endpoints:
/markdown— Renders each page and extracts its content as clean Markdown/links— Extracts all links from the page for discovery of new URLs
Starting from the provided URL, the crawler visits each page, converts it to Markdown for indexing, then follows discovered links up to the configured maxDepth and maxPages limits. The crawl runs in the background via ctx.waitUntil, so it does not block the API response or the agent conversation.
Note: The crawler uses the synchronous
/markdownand/linksendpoints, not the async/crawlendpoint, which is unreliable.
Configuration
When creating a crawl source, you can control scope with these parameters:
maxPages(up to 50) — Hard cap on total pages crawled. Prevents runaway crawls on large sites.maxDepth(up to 5) — How many link hops from the starting URL. Depth 1 means only pages directly linked from the start URL.includePatterns— Glob patterns that URLs must match to be crawled. Useful for restricting to specific sections (e.g.,["/docs/*"]).excludePatterns— Glob patterns for URLs to skip. Useful for excluding blogs, changelogs, or auth pages.
Delta Detection
The crawler computes a SHA-256 hash of each page's Markdown content. On subsequent crawls (re-crawl or reindex), pages whose hash matches the previous crawl are skipped entirely. Only changed or new pages are re-processed and re-embedded.
This makes re-crawling efficient: if a 40-page site has 3 pages that changed, only those 3 pages are re-indexed.
Re-Crawling
Triggering a reindex on a crawl source performs a full re-crawl with delta detection:
bash
curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/reindexThe re-crawl follows the same BFS traversal and respects the original maxPages, maxDepth, and pattern settings. Pages are compared by hash, and only changed/new pages are updated in the vector index.
LLM Tools
The agent has two built-in tools for crawling during conversations:
web_crawl— Initiates a website crawl with the same parameters (url, maxPages, maxDepth, includePatterns, excludePatterns). The crawl runs in the background and creates a new source.get_crawl_status— Checks the progress of an active crawl, returning page count, status, and any errors.
Example agent interaction:
User: "Crawl our help center at help.example.com and learn about our products"
Agent: [Calls web_crawl with url="https://help.example.com", maxPages=30]
"I've started crawling your help center. Let me check the progress..."
[Calls get_crawl_status]
"The crawl found 24 pages and is now indexing them. I'll be able to
answer questions about your products once it finishes."Crawl Metadata
The dashboard displays crawl-specific metadata for website sources:
- Page count — Total number of pages discovered and indexed
- Last crawled at — Timestamp of the most recent crawl
- Changed pages — Number of pages that were new or modified on the last re-crawl
- Unchanged pages — Number of pages skipped due to matching content hash
Processing Pipeline
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source │────▶│ Extract │────▶│ Chunk │
│ Input │ │ Content │ │ Text │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Store in │◀────│ Generate │◀────│ Clean & │
│ Vectorize │ │ Embeddings │ │ Normalize │
└──────────────┘ └──────────────┘ └──────────────┘Source Status
| Status | Description |
|---|---|
pending | Waiting to process |
processing | Currently processing |
crawling | Website crawl in progress |
ready | Available for search |
error | Processing failed |
updating | Refreshing content |
Chunking Strategy
Content is split into searchable chunks:
json
{
"chunking": {
"method": "semantic",
"max_size": 1000,
"overlap": 200
}
}Methods:
semantic- Smart paragraph splittingfixed- Fixed character countsentence- Sentence boundaries
Managing Sources
List Sources
bash
curl https://your-domain.com/api/agents/{id}/sourcesGet Source Details
bash
curl https://your-domain.com/api/agents/{id}/sources/{sourceId}Delete Source
bash
curl -X DELETE https://your-domain.com/api/agents/{id}/sources/{sourceId}Reindex Source
bash
curl -X POST https://your-domain.com/api/agents/{id}/sources/{sourceId}/reindexFor crawl sources, this triggers a full re-crawl with delta detection (only changed pages are re-indexed).
Storage
R2 Storage
Files are stored in Cloudflare R2:
- Automatic replication
- No egress fees
- Unlimited storage
Vectorize Index
Embeddings stored in Vectorize:
- Fast similarity search
- Automatic indexing
- Scalable to millions
Integration
With Chat
Sources are automatically searched:
User: "What's the return policy?"
Agent: [Searches sources] → [Finds relevant chunks] → [Generates response]With Workflows
json
{
"type": "search-sources",
"data": {
"query": "{{input.question}}",
"limit": 5
}
}Best Practices
1. Organize Sources
Group related content:
- Product documentation
- FAQ and support
- Policies and terms
2. Keep Content Fresh
Schedule regular updates:
json
{
"refresh_schedule": "0 0 * * *"
}3. Scope Crawls with Patterns
Use include/exclude patterns to focus crawls on relevant content:
json
{
"includePatterns": ["/docs/*", "/help/*"],
"excludePatterns": ["/blog/*", "/admin/*", "/login*"]
}4. Optimize Chunk Size
Balance context and precision:
- Larger chunks: More context
- Smaller chunks: Higher precision
5. Use Metadata
Add descriptive metadata:
json
{
"metadata": {
"category": "support",
"version": "2.0",
"language": "en"
}
}6. Monitor Quality
Review search results:
- Check relevance
- Update stale content
- Remove duplicates
API Reference
See Sources API for complete endpoint documentation.
