Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gloo.com/llms.txt

Use this file to discover all available pages before exploring further.

Prompt caching lets you reuse processed prompt tokens across multiple requests, reducing both cost and latency. When the same prompt prefix is sent repeatedly, cached tokens are served from cache rather than being re-processed—saving you money and delivering faster responses. Current per-token rates are available on the models page.

Types of Caching

There are three main approaches to prompt caching used by LLM providers:
  • Implicit caching — The provider automatically caches repeated prompt prefixes. No code changes or configuration required. Requests with matching prefixes are routed to servers that recently processed the same content, enabling cache hits automatically. This is the simplest approach and is used by OpenAI and DeepSeek.
  • Explicit caching — You declare which content to cache using API parameters (such as headers or request body fields). This gives you more control over what gets cached and when. Anthropic uses this model via the X-Cache-TTL header.
  • Storage-based caching — You create managed caches via API that persist for a defined duration, billed per token-hour. This offers the most control but requires cache lifecycle management. Gemini offers this as a separate feature from its implicit caching.
Most providers combine multiple approaches. The sections below explain which types each provider supports and how to use them through the Gloo AI API.

Anthropic — Explicit Caching

Anthropic uses explicit caching via the X-Cache-TTL header. You opt in per request by setting the header, and our API automatically wraps your system messages with cache_control: {type: "ephemeral"} blocks. No modifications to your message content are needed—just add the header.

How it works

  1. Set the X-Cache-TTL header to either "5m" or "1h".
  2. Up to 4 cache blocks per request are created from your system messages.
  3. Dynamic content at the end of your system messages (such as timestamps or user-specific data) is excluded from cache blocks automatically. Only the stable prefix of your system prompt is cached.

Billing

Tiercache_write ratecache_read rate
X-Cache-TTL: 5m125% of input rate10% of input rate
X-Cache-TTL: 1h200% of input rate10% of input rate
Cache writes are billed at a multiple of the standard input rate. Cache reads are always billed at 10% of the input rate.

Supported models

Anthropic caching works with all Claude models available through our API. See Supported Models for the current model list.

Example

curl -X POST 'https://platform.ai.gloo.com/ai/v2/chat/completions' \
  -H 'Authorization: Bearer ${ACCESS_TOKEN}' \
  -H 'Content-Type: application/json' \
  -H 'X-Cache-TTL: 5m' \
  -d '{
    "model": "gloo-anthropic-claude-sonnet-4.5",
    "messages": [
      {"role": "user", "content": "What does the Bible say about forgiveness?"}
    ],
    "stream": false
  }'

OpenAI — Implicit + Explicit Caching

OpenAI uses implicit caching as its primary mechanism. Caching works automatically on all API requests—no code changes required. When your prompt contains 1,024 tokens or more, OpenAI routes requests to servers that recently processed the same prompt prefix, enabling cache hits. This can reduce latency by up to 80% and input token costs by up to 90%. For finer control, you can use explicit caching by adding a prompt_cache_key to your request body. This optional parameter influences server routing so that requests sharing the same key and common prefix are more likely to land on the same server, improving cache hit rates.

How it works

  1. Caching is implicit and automatic for prompts with 1,024+ tokens.
  2. Requests are routed to servers that recently processed the same prompt prefix.
  3. Optionally, add "prompt_cache_key": "your-key" to your request body to improve routing for requests with shared prefixes.
  4. Cache hits reduce both latency and cost. Cached tokens are billed at 10% of the input rate with no write fee.

Billing

cache_read ratecache_write rate
10% of input rateNone
There is no write fee for OpenAI caching. You only pay for cache reads at 10% of the input rate.

Supported models

OpenAI caching works with OpenAI models available through our API. The prompt_cache_key parameter is silently ignored for other providers. See Supported Models for the current model list.

Best practice: stable cache keys

Use deterministic, stable keys that match your prompt structure. Changing the system prompt means you should use a new cache key.
Good:  "ministry-block-v1", "sermon-generator", "theology-assistant-v1"
Avoid: "request-12345", "cache-abc"

Example

{
  "model": "gloo-openai-gpt-5.2",
  "messages": [
    {"role": "system", "content": "You are a theological assistant..."},
    {"role": "user", "content": "Explain the Trinity"}
  ],
  "prompt_cache_key": "theology-assistant-v1",
  "stream": false
}

DeepSeek — Implicit Caching

DeepSeek uses implicit caching that works entirely on the provider side. No user action or configuration is needed.
  • No headers or request body fields required — caching is automatic.
  • Cache read rates vary per model (1–48% of input rate).
  • No cache write fee.
  • Savings from cache reads appear automatically in your billing.

Gemini

Prompt caching is not currently available for Gemini models through the Gloo AI API. Google offers native context caching as a separate, storage-based feature, but it is not proxied through our platform.

Billing Summary

ProviderTypecache_read ratecache_write rateNotes
AnthropicExplicit10% of input5m: 125%; 1h: 200% of inputTwo billing tiers via X-Cache-TTL header
OpenAIImplicit + Explicit10% of inputNoneAutomatic for 1,024+ tokens
DeepSeekImplicitVaries per model (1–48%)NoneAutomatic, no user action

How total cost is calculated

total_cost = fresh_tokens × input_rate
            + cached_tokens × cache_read_rate
            + cache_write_tokens × cache_write_rate
            + output_tokens × output_rate
A 5.5% Studio markup applies to all segments.

Checking your rates

Use the models endpoint to see current per-model rates:
GET /platform/v2/models

Best Practices

  1. Static content first: For all providers, place static content (system prompts, instructions) at the beginning of your prompt. This maximizes prefix matches for implicit caching and ensures stable blocks for explicit caching.
  2. Use explicit caching when available: For Anthropic models, add the X-Cache-TTL header to opt in. For OpenAI, use prompt_cache_key to improve routing for requests with shared prefixes.
  3. Pre-warm on startup: Send an initial request with your standard system prompt to warm the cache before serving users.
  4. Monitor cache impact: Compare your token usage and billing over time. Cached tokens are billed at a reduced rate, so you’ll see lower input costs when caching is effective. Check your usage in the Gloo Studio billing dashboard.

Migration Guide

If you’re not currently using prompt caching, here’s how to get started:

Implicit caching (OpenAI, DeepSeek)

No code changes needed. Caching is automatic for prompts with repeated prefixes. Place static content (system prompts, instructions) at the beginning of your prompt to maximize cache hits.

Explicit caching (Anthropic)

Add the X-Cache-TTL: 5m header to opt in to caching. System message caching is then automatic—no content modifications needed.
curl -X POST 'https://platform.ai.gloo.com/ai/v2/chat/completions' \
  -H 'Authorization: Bearer ${ACCESS_TOKEN}' \
  -H 'Content-Type: application/json' \
  -H 'X-Cache-TTL: 5m' \
  -d '{ "model": "gloo-anthropic-claude-sonnet-4.5", "messages": [{"role": "user", "content": "..."}] }'

Improving cache hit rates (OpenAI)

Add "prompt_cache_key": "your-key" to your request body with a stable key that matches your prompt prefix. This improves server routing for requests with shared prefixes.