Skip to content

LLM Integration

Configure AI providers for your agent.

Supported Providers

ProviderModelsVia
Workers AILlama 4 Scout, GPT-OSS 120BDirect
OpenAIGPT-4o, o1AI Gateway
AnthropicClaude Opus 4.5, Claude Sonnet 4.5AI Gateway
GoogleGemini 2.5 Pro, Gemini 2.5 FlashAI Gateway

Configuration

Via Settings

  1. Navigate to Settings
  2. Configure LLM section:
    • Provider
    • Model
    • Temperature
    • Max tokens
    • System prompt

Settings Schema

json
{
  "llm_provider": "workers-ai",
  "llm_model": "@cf/meta/llama-3.1-8b-instruct",
  "llm_temperature": 0.7,
  "llm_max_tokens": 2048,
  "llm_system_prompt": "You are a helpful assistant."
}

Workers AI Models

Available Models

ModelContextBest For
llama-4-scout131KGeneral chat, tool calling
gpt-oss-120b128KComplex reasoning, instruction following

No API Key Required

Workers AI is built-in and requires no API key. Models are accessed directly through Cloudflare's infrastructure.

GPT-OSS Format

GPT-OSS models use an instruction-based format instead of chat messages:

json
{
  "llm_provider": "workers-ai",
  "llm_model": "@cf/openai/gpt-oss-120b"
}

The model accepts instructions (system prompt) and input (user query) fields, optimized for single-turn tasks.

OpenAI Integration

Configuration

json
{
  "llm_provider": "openai",
  "llm_model": "gpt-4-turbo-preview"
}

API Key

Store in agent settings or secrets:

bash
npx wrangler secret put OPENAI_API_KEY

Anthropic Integration

Configuration

json
{
  "llm_provider": "anthropic",
  "llm_model": "claude-sonnet-4-5-20250514"
}

API Key

Store in agent settings or secrets:

bash
npx wrangler secret put ANTHROPIC_API_KEY

Google Integration

Configuration

json
{
  "llm_provider": "google",
  "llm_model": "gemini-2.5-flash"
}

API Key

Store in agent settings or secrets:

bash
npx wrangler secret put GOOGLE_AI_API_KEY

Parameters

Temperature

Controls randomness (0-1):

  • 0 - Deterministic, focused
  • 0.7 - Balanced (default)
  • 1 - Creative, varied

Max Tokens

Maximum response length:

  • Varies by model
  • Higher = longer responses
  • Consider context limits

System Prompt

Instructions for the AI:

You are a customer support agent for Acme Corp.
- Answer product questions
- Help with issues
- Be polite and professional

Best Practices

1. Choose the Right Model

Use CaseRecommended
General chatLlama 4 Scout (Workers AI)
Complex reasoningClaude Opus 4.5 / GPT-4o / GPT-OSS 120B
Fast responsesGemini 2.5 Flash
Code generationGPT-4o
Cost-effectiveWorkers AI models (Llama 4, GPT-OSS)

2. Write Clear Prompts

Be specific in system prompts:

  • Define the role
  • Set boundaries
  • Specify format

3. Monitor Usage

Track token consumption and costs (see Token Usage Tracking below).

4. Test Before Production

Verify responses with test conversations.

Token Usage Tracking

Monitor AI usage with detailed token counts per conversation, model, and time period.

Dashboard

Navigate to Token Usage in the settings to view:

  • Total tokens - Cumulative usage
  • By model - Breakdown per model
  • By period - Daily, weekly, monthly trends
  • By conversation - Per-thread usage

Usage Data

json
{
  "period": "2024-12",
  "totalTokens": 1250000,
  "promptTokens": 800000,
  "completionTokens": 450000,
  "models": {
    "@cf/meta/llama-4-scout": 900000,
    "gpt-4o": 250000,
    "claude-3-sonnet": 100000
  },
  "cost": {
    "estimated": 12.50,
    "currency": "USD"
  }
}

Per-Message Tracking

Each chat message records token usage:

json
{
  "id": "msg_abc123",
  "role": "assistant",
  "content": "...",
  "tokenUsage": {
    "promptTokens": 1250,
    "completionTokens": 340,
    "totalTokens": 1590
  }
}

Optimizing Token Usage

  1. Use Selective Tool Loading - Reduces system prompt by ~77%
  2. Enable Code Mode (Selective) - Compact tool hints instead of full schemas
  3. Trim conversation history - Automatic context management
  4. Choose efficient models - Smaller models for simple tasks