What Is Prompt Caching? — AI Glossary

James Swierczewski

May 17, 2026

What it is: Prompt caching is the practice of storing parts of an AI request so repeat calls reuse the cached portion at a fraction of the cost. Supported natively by Anthropic and OpenAI APIs. Can cut effective API costs by up to 90% on workloads where the same context loads repeatedly.
Who it is for: Developers building AI applications that repeatedly send the same long system prompt, document context, or codebase snapshot.
Best if: You want a short reference on what prompt caching is and when it pays back.
Skip if: You only use AI through chat interfaces that handle caching for you. Want one practical AI workflow every morning? Subscribe to our free daily newsletter.

Table of Contents

What is Prompt Caching?

Prompt caching is the practice of storing the prefix of an AI request — typically a long system prompt, a document, or a codebase context — so subsequent requests with the same prefix can reuse the cached portion at a steeply discounted rate. Anthropic and OpenAI both support it natively in their APIs. The cache is keyed on the exact tokens; any change to the cached portion invalidates the cache.

Why does Prompt Caching matter?

On workloads where the same context loads repeatedly, prompt caching can cut effective costs by up to 90%. This is especially impactful for AI coding agents like Claude Code, where every session loads the same CLAUDE.md, AGENTS.md, and project files. In a long-running session, more than 90% of the tokens billed are cache reads, not fresh input. The pricing math on Anthropic’s Max plans assumes this; without caching, the API would be substantially more expensive per session.

How does Prompt Caching work?

The cached prefix is stored on the vendor’s infrastructure for a short TTL (typically a few minutes; Anthropic’s Claude API extends this with explicit cache hints). Subsequent requests that begin with the same prefix get the cache discount on those tokens. Practical implications:

Order your prompts deliberately. Put the long stable context first (system prompt, AGENTS.md, file dumps) and the variable user input last. The stable part caches; the variable part doesn’t.
Don’t churn the prefix. Small edits to the system prompt invalidate the cache. Pin it for the duration of the session.
Batch requests when possible. Sequential calls within the TTL window all benefit from the cache.

Anthropic also offers a batch processing discount of 50% for non-real-time workloads — pair with caching for maximum savings.

Related terms

Learn more on Beginners in AI

Sources and further reading

Last reviewed: May 2026. AI terminology evolves quickly — verify specifics on the official source pages above.

Get Smarter About AI Every Morning

Free daily newsletter — one term, one tool, one tip. Plain English.

Free forever. Unsubscribe anytime.

Gemini Pricing: Free, Pro & Ultra

Best AI Prompts for Social Media

Do AI Detectors Work? What to Know