TL;DR Sometimes a session suddenly turns sluggish and burns through tokens, and at first you have no idea why. It's almost always the cache. Claude Code caches the prefix of every request automatically; a few mid-session moves invalidate it, and your next turn reprocesses the entire conversation at full price. The golden rule: pick your model and effort at the start, and don't touch them mid-task.
Claude Code handles prompt caching for you, so it isn't something you configure. It's something worth not breaking by accident. Knowing what invalidates it is the difference between a session that flows and one that suddenly grinds.
How the cache works
On every turn the model remembers nothing from the last one, so Claude Code re-sends all the context (system prompt, your CLAUDE.md, every prior message) and appends the new part at the end. The API caches by prefix: it matches the start of your request against what it already processed. The match is exact, so a change anywhere in the prefix recomputes everything after it. There's no per-file or per-segment cache.
To make the most of it, Claude Code orders each request from most stable to most volatile:
| Layer | Content | Changes when |
|---|---|---|
| System prompt | Core instructions, tool definitions, output style | The set of loaded tools changes, or Claude Code is upgraded |
| Project context | CLAUDE.md, memory, rules | The session starts, or after /clear or /compact |
| Conversation | Your messages, Claude's responses, tool results | Every turn |
A change in the conversation leaves the two top layers cached. A change in the system prompt invalidates everything. And two things that aren't even text are still part of the cache key: the model and the effort level. Change either one and you start a fresh cache from scratch.
What it looks like
# Healthy cache (the normal state)
cache_read_input_tokens: 48,231 ← reused, ~10% of the input price
cache_creation_input_tokens: 412 ← only what's new this turn
# The turn right after switching models mid-session
cache_read_input_tokens: 0 ← zero hits
cache_creation_input_tokens: 48,643 ← reprocesses EVERYTHING at full price
A cached read costs ~10% of the input price (per the official docs). A cache miss pays full price, which is why that turn is slow and costs roughly 10× as much on input.
The three buckets you need to tell apart
1. What breaks the cache (avoid mid-task)
- Switching models with
/model(and theopusplanplan-mode toggle, which is a model switch in disguise). - Changing effort with
/effort. Claude Code asks you to confirm precisely because it knows it costs you. - Turning on fast mode.
- Denying a whole tool (a bare
BashorWebFetch; scoped rules likeBash(rm *)and all allow/ask rules break nothing). - Upgrading Claude Code and then resuming a long session: the first turn reprocesses the whole history, often the most expensive request you'll send.
2. What resets the cache by design, but is cheap (don't fear it)
/compact: invalidates only the conversation layer. The summary is generated by reading the cache, and the next turn rebuilds a much shorter history, so it's not the slow part. Run it at natural breaks between tasks./clear: you start fresh on purpose./rewind: truncates back to a prefix that was already cached, so you abandon a path without rebuilding the cache, unlike/compact, which builds a new one.
3. What's always safe
- Editing files, invoking skills and commands,
/recap, switching permission modes, spawning subagents. - Editing CLAUDE.md or the output style mid-session doesn't break the cache, but it also doesn't apply: Claude keeps using the version loaded at startup until the next
/clearor restart.
About MCP and plugins: by default, on Opus and Sonnet, MCP tools are deferred, and connecting or disconnecting a server doesn't touch the cache. It only breaks when tools load into the prefix (Haiku, Vertex, a custom gateway, or
alwaysLoad). In the common case, relax.
Check whether your cache is healthy
The two numbers live in the current_usage object the API returns on every response, and the easiest way to watch them is a statusline script:
cache_read_input_tokens: tokens served from cache (at ~10% of input).cache_creation_input_tokens: tokens written to the cache this turn.
A high read-to-creation ratio means caching is working for you. If creation stays high turn after turn, something in your prefix keeps changing, so walk back through the list above.
Stretch the cache if you work in bursts (TTL)
The cache expires after a stretch of inactivity, and every hit resets the timer:
- By default, 5 minutes.
- On a Claude subscription: Claude Code requests the 1-hour TTL automatically, at no extra cost (your usage is included in the plan).
- On an API key, Bedrock, or Vertex: you pay per token, so it stays at 5 minutes unless you set
ENABLE_PROMPT_CACHING_1H=1. - Force 5 minutes for debugging:
FORCE_PROMPT_CACHING_5M=1.
This is the flip side of 10 habits to save tokens: there you shrink the context; here you avoid paying for it twice. And to see the real spend, /usage and /stats lay it out.
Official docs: How Claude Code uses prompt caching
Requirements: keeping the fast-mode header across toggles needs Claude Code v2.1.86+. The ENABLE_PROMPT_CACHING_1H, FORCE_PROMPT_CACHING_5M, and DISABLE_PROMPT_CACHING variables go in the env block of your settings.