Prompt caching lets you reuse processed prompt tokens across multiple requests, reducing both cost and latency. When the same prompt prefix is sent repeatedly, cached tokens are served from cache rather than being re-processed—saving you money and delivering faster responses. Current per-token rates are available on the models page.Documentation Index
Fetch the complete documentation index at: https://docs.gloo.com/llms.txt
Use this file to discover all available pages before exploring further.
Types of Caching
There are three main approaches to prompt caching used by LLM providers:- Implicit caching — The provider automatically caches repeated prompt prefixes. No code changes or configuration required. Requests with matching prefixes are routed to servers that recently processed the same content, enabling cache hits automatically. This is the simplest approach and is used by OpenAI and DeepSeek.
-
Explicit caching — You declare which content to cache using API parameters (such as headers or request body fields). This gives you more control over what gets cached and when. Anthropic uses this model via the
X-Cache-TTLheader. - Storage-based caching — You create managed caches via API that persist for a defined duration, billed per token-hour. This offers the most control but requires cache lifecycle management. Gemini offers this as a separate feature from its implicit caching.
Anthropic — Explicit Caching
Anthropic uses explicit caching via theX-Cache-TTL header. You opt in per request by setting the header, and our API automatically wraps your system messages with cache_control: {type: "ephemeral"} blocks. No modifications to your message content are needed—just add the header.
How it works
- Set the
X-Cache-TTLheader to either"5m"or"1h". - Up to 4 cache blocks per request are created from your system messages.
- Dynamic content at the end of your system messages (such as timestamps or user-specific data) is excluded from cache blocks automatically. Only the stable prefix of your system prompt is cached.
Billing
| Tier | cache_write rate | cache_read rate |
|---|---|---|
X-Cache-TTL: 5m | 125% of input rate | 10% of input rate |
X-Cache-TTL: 1h | 200% of input rate | 10% of input rate |
Supported models
Anthropic caching works with all Claude models available through our API. See Supported Models for the current model list.Example
OpenAI — Implicit + Explicit Caching
OpenAI uses implicit caching as its primary mechanism. Caching works automatically on all API requests—no code changes required. When your prompt contains 1,024 tokens or more, OpenAI routes requests to servers that recently processed the same prompt prefix, enabling cache hits. This can reduce latency by up to 80% and input token costs by up to 90%. For finer control, you can use explicit caching by adding aprompt_cache_key to your request body. This optional parameter influences server routing so that requests sharing the same key and common prefix are more likely to land on the same server, improving cache hit rates.
How it works
- Caching is implicit and automatic for prompts with 1,024+ tokens.
- Requests are routed to servers that recently processed the same prompt prefix.
- Optionally, add
"prompt_cache_key": "your-key"to your request body to improve routing for requests with shared prefixes. - Cache hits reduce both latency and cost. Cached tokens are billed at 10% of the input rate with no write fee.
Billing
| cache_read rate | cache_write rate |
|---|---|
| 10% of input rate | None |
Supported models
OpenAI caching works with OpenAI models available through our API. Theprompt_cache_key parameter is silently ignored for other providers. See Supported Models for the current model list.
Best practice: stable cache keys
Use deterministic, stable keys that match your prompt structure. Changing the system prompt means you should use a new cache key.Example
DeepSeek — Implicit Caching
DeepSeek uses implicit caching that works entirely on the provider side. No user action or configuration is needed.- No headers or request body fields required — caching is automatic.
- Cache read rates vary per model (1–48% of input rate).
- No cache write fee.
- Savings from cache reads appear automatically in your billing.
Gemini
Prompt caching is not currently available for Gemini models through the Gloo AI API. Google offers native context caching as a separate, storage-based feature, but it is not proxied through our platform.
Billing Summary
| Provider | Type | cache_read rate | cache_write rate | Notes |
|---|---|---|---|---|
| Anthropic | Explicit | 10% of input | 5m: 125%; 1h: 200% of input | Two billing tiers via X-Cache-TTL header |
| OpenAI | Implicit + Explicit | 10% of input | None | Automatic for 1,024+ tokens |
| DeepSeek | Implicit | Varies per model (1–48%) | None | Automatic, no user action |
How total cost is calculated
Checking your rates
Use the models endpoint to see current per-model rates:Best Practices
- Static content first: For all providers, place static content (system prompts, instructions) at the beginning of your prompt. This maximizes prefix matches for implicit caching and ensures stable blocks for explicit caching.
- Use explicit caching when available: For Anthropic models, add the
X-Cache-TTLheader to opt in. For OpenAI, useprompt_cache_keyto improve routing for requests with shared prefixes. - Pre-warm on startup: Send an initial request with your standard system prompt to warm the cache before serving users.
- Monitor cache impact: Compare your token usage and billing over time. Cached tokens are billed at a reduced rate, so you’ll see lower input costs when caching is effective. Check your usage in the Gloo Studio billing dashboard.
Migration Guide
If you’re not currently using prompt caching, here’s how to get started:Implicit caching (OpenAI, DeepSeek)
No code changes needed. Caching is automatic for prompts with repeated prefixes. Place static content (system prompts, instructions) at the beginning of your prompt to maximize cache hits.Explicit caching (Anthropic)
Add theX-Cache-TTL: 5m header to opt in to caching. System message caching is then automatic—no content modifications needed.
Improving cache hit rates (OpenAI)
Add"prompt_cache_key": "your-key" to your request body with a stable key that matches your prompt prefix. This improves server routing for requests with shared prefixes.
Related Documentation
- Completions V2 — Core routing mechanisms and streaming details
- Supported Models — Model IDs, capabilities, and current pricing

