Prompt Caching in GPT-4o, Claude Sonnet 4.6, and Gemini 2.5 Pro: How It Works and What You Save

Every model call with a long system prompt re-encodes the same tokens unless the provider has seen that prefix before. GPT-4o handles this automatically: OpenAI caches any prompt prefix of 1,024 tokens or more and applies a 50% discount on those cached input tokens — no code change required, just a cache hit indicator in the API response. Claude Sonnet 4.6 takes an explicit approach via cache_control blocks: mark any message (system prompt, large document, few-shot block) with "type": "ephemeral" and Anthropic stores it for five minutes, charging 25% of normal input cost on subsequent hits and a one-time 125% write charge. Gemini 2.5 Pro provides context caching through a dedicated API call that pins a prefix for a minimum of one hour, making it most cost-effective for long-lived server contexts with 32k+ token prefixes.

The practical impact compounds with scale. A LangGraph customer support agent with a 2,000-token system prompt and RAG few-shot examples making 10,000 calls per day against GPT-4o saves roughly $45/month from caching alone — before any RAG or retrieval optimization. The savings grow linearly with call volume and prefix length. To see how your specific system prompt and prompt structure affect costs at production volume, paste it into the Token Inspector and set daily volume — the tool models cached vs. uncached cost side-by-side across GPT-4o, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 2.5 Pro, and Gemini 2.5 Flash.