Skip to content

Sources API

Manage data sources and RAG search.

Endpoints

MethodEndpointDescription
GET/api/agents/{id}/sourcesList sources
POST/api/agents/{id}/sourcesCreate source
POST/api/agents/{id}/sources/crawlCrawl a website
GET/api/agents/{id}/sources/searchSearch sources
GET/api/agents/{id}/sources/{sourceId}Get source
DELETE/api/agents/{id}/sources/{sourceId}Delete source
POST/api/agents/{id}/sources/{sourceId}/reindexReindex source

List Sources

bash
GET /api/agents/{id}/sources

Response

json
{
  "sources": [
    {
      "id": "source-123",
      "type": "file",
      "name": "Product Manual",
      "status": "ready",
      "chunks": 150,
      "created_at": "2024-12-01T00:00:00Z"
    },
    {
      "id": "source-456",
      "type": "crawl",
      "name": "Help Center",
      "status": "ready",
      "chunks": 320,
      "config": {
        "url": "https://help.example.com",
        "maxPages": 50,
        "maxDepth": 3,
        "pageCount": 42,
        "lastCrawledAt": "2026-03-10T14:30:00Z"
      },
      "created_at": "2026-03-01T00:00:00Z"
    }
  ]
}

Create Source

File Upload

bash
POST /api/agents/{id}/sources
Content-Type: multipart/form-data

-F "type=file"
-F "name=Product Manual"
-F "file=@/path/to/manual.pdf"

Response

json
{
  "id": "source-123",
  "type": "file",
  "name": "Product Manual",
  "status": "processing",
  "created_at": "2024-12-15T10:00:00Z"
}

Crawl Website

bash
POST /api/agents/{id}/sources/crawl
Content-Type: application/json

Starts a BFS crawl of the target website using Cloudflare Browser Rendering (/markdown + /links endpoints). The crawl runs in the background via ctx.waitUntil and does not block the response.

Request Body

json
{
  "url": "https://docs.example.com",
  "name": "Documentation Site",
  "maxPages": 30,
  "maxDepth": 3,
  "includePatterns": ["/docs/*", "/guides/*"],
  "excludePatterns": ["/blog/*", "/changelog/*"]
}

Parameters

FieldTypeRequiredDefaultDescription
urlstringYesStarting URL for the crawl
namestringYesDisplay name for the source
maxPagesnumberNo50Maximum pages to crawl (up to 50)
maxDepthnumberNo3Maximum link depth from start URL (up to 5)
includePatternsstring[]No[]Glob patterns — only crawl URLs matching at least one pattern
excludePatternsstring[]No[]Glob patterns — skip URLs matching any pattern

Response

json
{
  "id": "source-789",
  "type": "crawl",
  "name": "Documentation Site",
  "status": "crawling",
  "config": {
    "url": "https://docs.example.com",
    "maxPages": 30,
    "maxDepth": 3,
    "includePatterns": ["/docs/*", "/guides/*"],
    "excludePatterns": ["/blog/*", "/changelog/*"]
  },
  "created_at": "2026-03-11T10:00:00Z"
}

The source status will be crawling while the BFS traversal is in progress. Once complete, pages are chunked and embedded, and the status transitions to ready.

Crawl Source Config Fields

When retrieving a crawl source, the config object contains these additional metadata fields:

FieldTypeDescription
urlstringThe starting URL
maxPagesnumberConfigured page limit
maxDepthnumberConfigured depth limit
includePatternsstring[]URL include patterns
excludePatternsstring[]URL exclude patterns
pageCountnumberTotal pages discovered and indexed
lastCrawledAtstringISO 8601 timestamp of the most recent crawl
pageHashesobjectMap of URL to SHA-256 content hash (used for delta detection)
changedPagesnumberPages that were new or modified on the last re-crawl
unchangedPagesnumberPages skipped (content hash matched previous crawl)

Reindex Source

bash
POST /api/agents/{id}/sources/{sourceId}/reindex

Reprocesses the source content. For crawl sources, this triggers a full re-crawl with delta detection: the crawler revisits all pages but only re-indexes those whose SHA-256 content hash has changed since the last crawl. Unchanged pages are skipped.

Response

json
{
  "id": "source-789",
  "type": "crawl",
  "status": "crawling",
  "message": "Re-crawl started with delta detection"
}

After the re-crawl completes, the source config will include updated changedPages and unchangedPages counts:

json
{
  "id": "source-789",
  "type": "crawl",
  "name": "Documentation Site",
  "status": "ready",
  "config": {
    "url": "https://docs.example.com",
    "maxPages": 30,
    "maxDepth": 3,
    "pageCount": 42,
    "lastCrawledAt": "2026-03-11T15:00:00Z",
    "changedPages": 3,
    "unchangedPages": 39
  }
}

Search Sources

bash
GET /api/agents/{id}/sources/search?query=how+to+reset+password&limit=5

Query Parameters

ParameterTypeDescription
querystringSearch query (required)
limitnumberMax results (default: 10)
source_idstringFilter by source
min_scorenumberMinimum similarity score

Response

json
{
  "results": [
    {
      "id": "chunk-123",
      "source_id": "source-456",
      "content": "To reset your password, go to Settings > Account > Password...",
      "score": 0.92,
      "metadata": {
        "source_name": "User Guide",
        "page": 15,
        "section": "Account Settings"
      }
    }
  ]
}

Get Source

bash
GET /api/agents/{id}/sources/{sourceId}

Response

json
{
  "id": "source-123",
  "type": "file",
  "name": "Product Manual",
  "file_path": "uploads/manual.pdf",
  "status": "ready",
  "chunks": 150,
  "config": {
    "chunk_size": 1000,
    "overlap": 200
  },
  "created_at": "2024-12-01T00:00:00Z",
  "updated_at": "2024-12-15T10:00:00Z"
}

Crawl Source Response

json
{
  "id": "source-789",
  "type": "crawl",
  "name": "Documentation Site",
  "status": "ready",
  "chunks": 320,
  "config": {
    "url": "https://docs.example.com",
    "maxPages": 30,
    "maxDepth": 3,
    "includePatterns": ["/docs/*"],
    "excludePatterns": ["/blog/*"],
    "pageCount": 42,
    "lastCrawledAt": "2026-03-11T15:00:00Z",
    "changedPages": 3,
    "unchangedPages": 39
  },
  "created_at": "2026-03-01T00:00:00Z",
  "updated_at": "2026-03-11T15:00:00Z"
}

Delete Source

bash
DELETE /api/agents/{id}/sources/{sourceId}

Response

json
{
  "success": true
}

Refresh Source

bash
POST /api/agents/{id}/sources/{sourceId}/refresh

Reprocesses the source content. For crawl sources, use the Reindex endpoint instead, which performs a re-crawl with delta detection.

Source Types

TypeDescription
fileUploaded file
crawlCrawled website (BFS traversal with delta detection)
databaseDatabase query
apiAPI endpoint

Source Status

StatusDescription
pendingWaiting to process
processingCurrently processing
crawlingWebsite crawl in progress (BFS traversal)
readyAvailable for search
errorProcessing failed
updatingRefreshing content

Supported Formats

FormatExtensions
PDF.pdf
Text.txt
Markdown.md
Word.docx
JSON.json
CSV.csv

Chunking Config

json
{
  "config": {
    "chunk_size": 1000,
    "overlap": 200,
    "method": "semantic"
  }
}

Errors

CodeDescription
400Invalid source data
404Source not found
413File too large
415Unsupported format

Examples

Upload PDF

bash
curl -X POST .../sources \
  -F "type=file" \
  -F "name=User Guide" \
  -F "[email protected]"

Search with Filter

bash
curl ".../sources/search?query=installation&source_id=source-123&min_score=0.8"

Crawl Website

bash
curl -X POST .../sources/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://help.example.com",
    "name": "Help Center",
    "maxPages": 30,
    "maxDepth": 3,
    "includePatterns": ["/docs/*"],
    "excludePatterns": ["/blog/*"]
  }'

Check Crawl Source Details

bash
curl .../sources/source-789

Response includes crawl metadata (pageCount, lastCrawledAt, changedPages, unchangedPages).

Reindex Crawl Source (Delta Re-crawl)

bash
curl -X POST .../sources/source-789/reindex

Triggers a re-crawl that only re-indexes pages whose content has changed.