Prompt Caching

Prompt caching lets you reuse processed prompt tokens across multiple requests, reducing both cost and latency. When the same prompt prefix is sent repeatedly, cached tokens are served from cache rather than being re-processed—saving you money and delivering faster responses. Current per-token rates are available on the models page.

Types of Caching

There are three main approaches to prompt caching used by LLM providers:

Implicit caching — The provider automatically caches repeated prompt prefixes. No code changes or configuration required. Requests with matching prefixes are routed to servers that recently processed the same content, enabling cache hits automatically. This is the simplest approach and is used by OpenAI, DeepSeek, Gemini, and many open-source models (Qwen, MiniMax, and others).
Explicit caching — You declare which content to cache using API parameters (such as headers or request body fields). This gives you more control over what gets cached and when. Anthropic uses this model via the X-Cache-TTL header.
Storage-based caching — You create managed caches via API that persist for a defined duration, billed per token-hour. This offers the most control but requires cache lifecycle management. Gemini offers this as a separate feature from its implicit caching.

Most providers combine multiple approaches. The sections below explain which types each provider supports and how to use them through the Gloo AI API.

Anthropic — Explicit Caching

Anthropic uses explicit caching via the X-Cache-TTL header. You opt in per request by setting the header, and our API automatically wraps your system messages with cache_control: {type: "ephemeral"} blocks. No modifications to your message content are needed—just add the header.

How it works

Set the X-Cache-TTL header to either "5m" or "1h".
Our API places one cache_control marker on the last system message to cache the entire stable system-prefix block. A single marker caches everything before it, covering the full assembled system prompt.
The remaining three breakpoints are free for your own cache_control markers on your message content.
If your request already contains 4 cache_control markers, our marker is silently skipped — your request succeeds with no error, but system-prefix caching is not applied for that call.
Dynamic content at the end of your system messages (such as timestamps or user-specific data) is excluded from the cached block automatically.

Endpoint behavior

Endpoint	Breakpoint ownership	What happens
Routed (`/ai/v2/chat/completions`)	Gloo owns 1 (system prefix); you own up to 3	Our marker is auto-placed on the last system message. You may add up to 3 of your own on your content.

The Anthropic breakpoint limit is 4 per request. Minimum cacheable token thresholds vary by model (e.g., 4,096 tokens for Claude Haiku 4.5). See Anthropic’s prompt caching docs for current thresholds.

Billing

Tier	cache_write rate	cache_read rate
`X-Cache-TTL: 5m`	125% of input rate	10% of input rate
`X-Cache-TTL: 1h`	200% of input rate	10% of input rate

Cache writes are billed at a multiple of the standard input rate. Cache reads are always billed at 10% of the input rate.

Supported models

Anthropic caching works with all Claude models available through our API. See Supported Models for the current model list.

Example

curl -X POST 'https://platform.ai.gloo.com/ai/v2/chat/completions' \
  -H 'Authorization: Bearer ${ACCESS_TOKEN}' \
  -H 'Content-Type: application/json' \
  -H 'X-Cache-TTL: 5m' \
  -d '{
    "model": "gloo-anthropic-claude-sonnet-4.5",
    "messages": [
      {"role": "user", "content": "What does the Bible say about forgiveness?"}
    ],
    "stream": false
  }'

OpenAI — Implicit + Explicit Caching

OpenAI uses implicit caching as its primary mechanism. Caching works automatically on all API requests—no code changes required. When your prompt contains 1,024 tokens or more, OpenAI routes requests to servers that recently processed the same prompt prefix, enabling cache hits. This can reduce latency by up to 80% and input token costs by up to 90%. See OpenAI’s prompt caching announcement for details. For finer control, you can use explicit caching by adding a prompt_cache_key to your request body. This optional parameter influences server routing so that requests sharing the same key and common prefix are more likely to land on the same server, improving cache hit rates.

How it works

Caching is implicit and automatic for prompts with 1,024+ tokens.
Requests are routed to servers that recently processed the same prompt prefix.
Optionally, add "prompt_cache_key": "your-key" to your request body to improve routing for requests with shared prefixes.
Cache hits reduce both latency and cost. Cached tokens are billed at a discounted rate that varies by model — 10% of the input rate for GPT-5 models and 25% for GPT-4.1 and o-series models — with no write fee.

Billing

The cache read discount depends on the model. Current OpenAI families:

Model family	cache_read rate	cache_write rate
GPT-5 family	10% of input rate	None
GPT-4.1 family, o3, o4-mini	25% of input rate	None

There is no write fee for OpenAI caching — you only pay for cache reads, at the model-specific rate above. Note that cheaper GPT-5 variants (gpt-5-mini, gpt-5-nano) have higher effective cache-read ratios (~23–56%) because of a minimum read-rate floor. See Checking your rates for exact per-model values.

Supported models

OpenAI caching works with OpenAI models available through our API. The prompt_cache_key parameter is silently ignored for other providers. See Supported Models for the current model list.

Best practice: stable cache keys

Use deterministic, stable keys that match your prompt structure. Changing the system prompt means you should use a new cache key.

Good:  "ministry-block-v1", "sermon-generator", "theology-assistant-v1"
Avoid: "request-12345", "cache-abc"

Example

{
  "model": "gloo-openai-gpt-5.2",
  "messages": [
    { "role": "system", "content": "You are a theological assistant..." },
    { "role": "user", "content": "Explain the Trinity" }
  ],
  "prompt_cache_key": "theology-assistant-v1",
  "stream": false
}

DeepSeek — Implicit Caching

DeepSeek uses implicit caching that works entirely on the provider side. No user action or configuration is needed.

No headers or request body fields required — caching is automatic.
Cache read rates vary per model (roughly 9–90% depending on the model).
No cache write fee.
Savings from cache reads appear automatically in your billing.

Gemini — Implicit Caching

Gemini uses implicit caching on the provider side. No headers or request body fields are required—repeated prompt prefixes are automatically cached when you send a request that matches a previously seen prefix.

How it works

No customer action needed — caching is automatic. No X-Cache-TTL header, no cache_control markers, and no prompt_cache_key are required.
Repeated prefixes are cached by the provider automatically.
Cache read rates are roughly 10% of the input rate for larger Gemini models; cheaper flash and flash-lite models are higher (up to ~33%) — check the models endpoint for exact per-model rates. Cache hits surface as a separate cache_read segment in your billing, the same as the other providers. There is no cache write fee for implicit caching.

Supported models

Gemini caching is available for all Gemini models in the Gloo AI catalog. Minimum token thresholds vary by model and are determined by the provider.

Model	Minimum token threshold
Gemini 2.5 Flash, 2.5 Flash Lite	2,048 tokens
Gemini 3 Flash, 3.1 Flash Lite, 3.5 Flash	4,096 tokens

Other Models (Qwen, MiniMax, and more) — Implicit Caching

A number of additional models in the Gloo AI catalog — including Qwen, MiniMax, and Xiaomi MiMo — support implicit caching through their upstream provider. Caching is automatic; no headers, cache_control markers, or prompt_cache_key are required.

Automatic — repeated prompt prefixes are cached provider-side and surface as a cache_read segment in your billing, the same as the other providers.
Cache read rates vary widely per model (roughly 25–65% of the input rate — the discount is much smaller than the major providers’ for some models). A few models (e.g. some Qwen variants) also carry a cache write fee of 1.25× the input rate; most do not.
Because rates differ substantially between models, use the models endpoint for exact per-model cache_read / cache_write values.

Not every model in this group caches — support is set by the upstream provider, and the models endpoint is the source of truth for which models have cache pricing.

Billing Summary

Provider	Type	cache_read rate	cache_write rate	Notes
Anthropic	Explicit	10% of input	5m: 125%; 1h: 200% of input	Two billing tiers via `X-Cache-TTL` header
OpenAI	Implicit + Explicit	Varies per model (10% GPT-5; 25% GPT-4.1 & o-series)	None	Automatic for 1,024+ tokens
DeepSeek	Implicit	Varies per model (~9–90%)	None	Automatic, no user action
Gemini	Implicit	Varies per model (~10%; flash/flash-lite up to ~33%)	None	Automatic, no user action required
Other (Qwen, MiniMax, Xiaomi, …)	Implicit	Varies per model (~25–65%)	Most none; some 1.25×	Automatic, set by upstream provider

How total cost is calculated

total_cost = fresh_tokens × input_rate
            + cached_tokens × cache_read_rate
            + cache_write_tokens × cache_write_rate
            + output_tokens × output_rate

The formula above applies to all four providers — cached tokens are billed as a separate cache_read segment. Gemini and OpenAI have no cache write fee, so their cache_write term is always zero.

A flat platform rate applies to all segments. The per-segment rates returned by the models endpoint are list prices (the model’s base rate); the platform rate is added at billing.

Checking your rates

Use the models endpoint to see current per-model rates:

GET /platform/v2/models

Best Practices

Static content first: For all providers, place static content (system prompts, instructions) at the beginning of your prompt. This maximizes prefix matches for implicit caching and ensures stable blocks for explicit caching.
Use explicit caching when available: For Anthropic models, add the X-Cache-TTL header to opt in. For OpenAI, use prompt_cache_key to improve routing for requests with shared prefixes.
Pre-warm on startup: Send an initial request with your standard system prompt to warm the cache before serving users.
Monitor cache impact: Compare your token usage and billing over time. Cached tokens are billed at a reduced rate, so you’ll see lower input costs when caching is effective. Check your usage in the Gloo Studio billing dashboard.

Migration Guide

If you’re not currently using prompt caching, here’s how to get started:

Implicit caching (OpenAI, DeepSeek, Gemini)

No code changes needed. Caching is automatic for prompts with repeated prefixes. Place static content (system prompts, instructions) at the beginning of your prompt to maximize cache hits. On Gemini, prefix-only caching works on the provider side with no special parameters.

Explicit caching (Anthropic)

Add the X-Cache-TTL: 5m header to opt in to caching. System message caching is then automatic—no content modifications needed.

curl -X POST 'https://platform.ai.gloo.com/ai/v2/chat/completions' \
  -H 'Authorization: Bearer ${ACCESS_TOKEN}' \
  -H 'Content-Type: application/json' \
  -H 'X-Cache-TTL: 5m' \
  -d '{ "model": "gloo-anthropic-claude-sonnet-4.5", "messages": [{"role": "user", "content": "..."}] }'

Improving cache hit rates (OpenAI)

Add "prompt_cache_key": "your-key" to your request body with a stable key that matches your prompt prefix. This improves server routing for requests with shared prefixes.

Completions V2 — Core routing mechanisms and streaming details
Supported Models — Model IDs, capabilities, and current pricing

Get Started

Completions & Responses

Content

Organization & Billing

GlooCode

Learn More

External

Legacy

Prompt Caching

Types of Caching

Anthropic — Explicit Caching

How it works

Endpoint behavior

Billing

Supported models

Example

OpenAI — Implicit + Explicit Caching

How it works

Billing

Supported models

Best practice: stable cache keys

Example

DeepSeek — Implicit Caching

Gemini — Implicit Caching

How it works

Supported models

Other Models (Qwen, MiniMax, and more) — Implicit Caching

Billing Summary

How total cost is calculated

Checking your rates

Best Practices

Migration Guide

Implicit caching (OpenAI, DeepSeek, Gemini)

Explicit caching (Anthropic)

Improving cache hit rates (OpenAI)

​Types of Caching

​Anthropic — Explicit Caching

​How it works

​Endpoint behavior

​Billing

​Supported models

​Example

​OpenAI — Implicit + Explicit Caching

​How it works

​Billing

​Supported models

​Best practice: stable cache keys

​Example

​DeepSeek — Implicit Caching

​Gemini — Implicit Caching

​How it works

​Supported models

​Other Models (Qwen, MiniMax, and more) — Implicit Caching

​Billing Summary

​How total cost is calculated

​Checking your rates

​Best Practices

​Migration Guide

​Implicit caching (OpenAI, DeepSeek, Gemini)

​Explicit caching (Anthropic)

​Improving cache hit rates (OpenAI)

​Related Documentation

Types of Caching

Anthropic — Explicit Caching

How it works

Endpoint behavior

Billing

Supported models

Example

OpenAI — Implicit + Explicit Caching

How it works

Billing

Supported models

Best practice: stable cache keys

Example

DeepSeek — Implicit Caching

Gemini — Implicit Caching

How it works

Supported models

Other Models (Qwen, MiniMax, and more) — Implicit Caching

Billing Summary

How total cost is calculated

Checking your rates

Best Practices

Migration Guide

Implicit caching (OpenAI, DeepSeek, Gemini)

Explicit caching (Anthropic)

Improving cache hit rates (OpenAI)

Related Documentation