Appearance
Sources API
Manage data sources and RAG search.
Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/agents/{id}/sources | List sources |
| POST | /api/agents/{id}/sources | Create source |
| POST | /api/agents/{id}/sources/crawl | Crawl a website |
| GET | /api/agents/{id}/sources/search | Search sources |
| GET | /api/agents/{id}/sources/{sourceId} | Get source |
| DELETE | /api/agents/{id}/sources/{sourceId} | Delete source |
| POST | /api/agents/{id}/sources/{sourceId}/reindex | Reindex source |
List Sources
bash
GET /api/agents/{id}/sourcesResponse
json
{
"sources": [
{
"id": "source-123",
"type": "file",
"name": "Product Manual",
"status": "ready",
"chunks": 150,
"created_at": "2024-12-01T00:00:00Z"
},
{
"id": "source-456",
"type": "crawl",
"name": "Help Center",
"status": "ready",
"chunks": 320,
"config": {
"url": "https://help.example.com",
"maxPages": 50,
"maxDepth": 3,
"pageCount": 42,
"lastCrawledAt": "2026-03-10T14:30:00Z"
},
"created_at": "2026-03-01T00:00:00Z"
}
]
}Create Source
File Upload
bash
POST /api/agents/{id}/sources
Content-Type: multipart/form-data
-F "type=file"
-F "name=Product Manual"
-F "file=@/path/to/manual.pdf"Response
json
{
"id": "source-123",
"type": "file",
"name": "Product Manual",
"status": "processing",
"created_at": "2024-12-15T10:00:00Z"
}Crawl Website
bash
POST /api/agents/{id}/sources/crawl
Content-Type: application/jsonStarts a BFS crawl of the target website using Cloudflare Browser Rendering (/markdown + /links endpoints). The crawl runs in the background via ctx.waitUntil and does not block the response.
Request Body
json
{
"url": "https://docs.example.com",
"name": "Documentation Site",
"maxPages": 30,
"maxDepth": 3,
"includePatterns": ["/docs/*", "/guides/*"],
"excludePatterns": ["/blog/*", "/changelog/*"]
}Parameters
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | — | Starting URL for the crawl |
name | string | Yes | — | Display name for the source |
maxPages | number | No | 50 | Maximum pages to crawl (up to 50) |
maxDepth | number | No | 3 | Maximum link depth from start URL (up to 5) |
includePatterns | string[] | No | [] | Glob patterns — only crawl URLs matching at least one pattern |
excludePatterns | string[] | No | [] | Glob patterns — skip URLs matching any pattern |
Response
json
{
"id": "source-789",
"type": "crawl",
"name": "Documentation Site",
"status": "crawling",
"config": {
"url": "https://docs.example.com",
"maxPages": 30,
"maxDepth": 3,
"includePatterns": ["/docs/*", "/guides/*"],
"excludePatterns": ["/blog/*", "/changelog/*"]
},
"created_at": "2026-03-11T10:00:00Z"
}The source status will be crawling while the BFS traversal is in progress. Once complete, pages are chunked and embedded, and the status transitions to ready.
Crawl Source Config Fields
When retrieving a crawl source, the config object contains these additional metadata fields:
| Field | Type | Description |
|---|---|---|
url | string | The starting URL |
maxPages | number | Configured page limit |
maxDepth | number | Configured depth limit |
includePatterns | string[] | URL include patterns |
excludePatterns | string[] | URL exclude patterns |
pageCount | number | Total pages discovered and indexed |
lastCrawledAt | string | ISO 8601 timestamp of the most recent crawl |
pageHashes | object | Map of URL to SHA-256 content hash (used for delta detection) |
changedPages | number | Pages that were new or modified on the last re-crawl |
unchangedPages | number | Pages skipped (content hash matched previous crawl) |
Reindex Source
bash
POST /api/agents/{id}/sources/{sourceId}/reindexReprocesses the source content. For crawl sources, this triggers a full re-crawl with delta detection: the crawler revisits all pages but only re-indexes those whose SHA-256 content hash has changed since the last crawl. Unchanged pages are skipped.
Response
json
{
"id": "source-789",
"type": "crawl",
"status": "crawling",
"message": "Re-crawl started with delta detection"
}After the re-crawl completes, the source config will include updated changedPages and unchangedPages counts:
json
{
"id": "source-789",
"type": "crawl",
"name": "Documentation Site",
"status": "ready",
"config": {
"url": "https://docs.example.com",
"maxPages": 30,
"maxDepth": 3,
"pageCount": 42,
"lastCrawledAt": "2026-03-11T15:00:00Z",
"changedPages": 3,
"unchangedPages": 39
}
}Search Sources
bash
GET /api/agents/{id}/sources/search?query=how+to+reset+password&limit=5Query Parameters
| Parameter | Type | Description |
|---|---|---|
query | string | Search query (required) |
limit | number | Max results (default: 10) |
source_id | string | Filter by source |
min_score | number | Minimum similarity score |
Response
json
{
"results": [
{
"id": "chunk-123",
"source_id": "source-456",
"content": "To reset your password, go to Settings > Account > Password...",
"score": 0.92,
"metadata": {
"source_name": "User Guide",
"page": 15,
"section": "Account Settings"
}
}
]
}Get Source
bash
GET /api/agents/{id}/sources/{sourceId}Response
json
{
"id": "source-123",
"type": "file",
"name": "Product Manual",
"file_path": "uploads/manual.pdf",
"status": "ready",
"chunks": 150,
"config": {
"chunk_size": 1000,
"overlap": 200
},
"created_at": "2024-12-01T00:00:00Z",
"updated_at": "2024-12-15T10:00:00Z"
}Crawl Source Response
json
{
"id": "source-789",
"type": "crawl",
"name": "Documentation Site",
"status": "ready",
"chunks": 320,
"config": {
"url": "https://docs.example.com",
"maxPages": 30,
"maxDepth": 3,
"includePatterns": ["/docs/*"],
"excludePatterns": ["/blog/*"],
"pageCount": 42,
"lastCrawledAt": "2026-03-11T15:00:00Z",
"changedPages": 3,
"unchangedPages": 39
},
"created_at": "2026-03-01T00:00:00Z",
"updated_at": "2026-03-11T15:00:00Z"
}Delete Source
bash
DELETE /api/agents/{id}/sources/{sourceId}Response
json
{
"success": true
}Refresh Source
bash
POST /api/agents/{id}/sources/{sourceId}/refreshReprocesses the source content. For crawl sources, use the Reindex endpoint instead, which performs a re-crawl with delta detection.
Source Types
| Type | Description |
|---|---|
file | Uploaded file |
crawl | Crawled website (BFS traversal with delta detection) |
database | Database query |
api | API endpoint |
Source Status
| Status | Description |
|---|---|
pending | Waiting to process |
processing | Currently processing |
crawling | Website crawl in progress (BFS traversal) |
ready | Available for search |
error | Processing failed |
updating | Refreshing content |
Supported Formats
| Format | Extensions |
|---|---|
| Text | .txt |
| Markdown | .md |
| Word | .docx |
| JSON | .json |
| CSV | .csv |
Chunking Config
json
{
"config": {
"chunk_size": 1000,
"overlap": 200,
"method": "semantic"
}
}Errors
| Code | Description |
|---|---|
| 400 | Invalid source data |
| 404 | Source not found |
| 413 | File too large |
| 415 | Unsupported format |
Examples
Upload PDF
bash
curl -X POST .../sources \
-F "type=file" \
-F "name=User Guide" \
-F "[email protected]"Search with Filter
bash
curl ".../sources/search?query=installation&source_id=source-123&min_score=0.8"Crawl Website
bash
curl -X POST .../sources/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://help.example.com",
"name": "Help Center",
"maxPages": 30,
"maxDepth": 3,
"includePatterns": ["/docs/*"],
"excludePatterns": ["/blog/*"]
}'Check Crawl Source Details
bash
curl .../sources/source-789Response includes crawl metadata (pageCount, lastCrawledAt, changedPages, unchangedPages).
Reindex Crawl Source (Delta Re-crawl)
bash
curl -X POST .../sources/source-789/reindexTriggers a re-crawl that only re-indexes pages whose content has changed.
