Prompt caching in Claude Code: why your next turn is slow and expensive (and how to avoid it)

Some sessions suddenly slow to a crawl and burn through tokens, and you can't tell why. The cause is almost always the same, and it's avoidable: the prefix Claude Code caches just got broken.

The three prefix layers Claude Code caches and the actions that invalidate them

TL;DR Sometimes a session suddenly turns sluggish and burns through tokens, and at first you have no idea why. It's almost always the cache. Claude Code caches the prefix of every request automatically; a few mid-session moves invalidate it, and your next turn reprocesses the entire conversation at full price. The golden rule: pick your model and effort at the start, and don't touch them mid-task.

Claude Code handles prompt caching for you, so it isn't something you configure. It's something worth not breaking by accident. Knowing what invalidates it is the difference between a session that flows and one that suddenly grinds.

How the cache works

On every turn the model remembers nothing from the last one, so Claude Code re-sends all the context (system prompt, your CLAUDE.md, every prior message) and appends the new part at the end. The API caches by prefix: it matches the start of your request against what it already processed. The match is exact, so a change anywhere in the prefix recomputes everything after it. There's no per-file or per-segment cache.

To make the most of it, Claude Code orders each request from most stable to most volatile:

Layer	Content	Changes when
System prompt	Core instructions, tool definitions, output style	The set of loaded tools changes, or Claude Code is upgraded
Project context	CLAUDE.md, memory, rules	The session starts, or after `/clear` or `/compact`
Conversation	Your messages, Claude's responses, tool results	Every turn

A change in the conversation leaves the two top layers cached. A change in the system prompt invalidates everything. And two things that aren't even text are still part of the cache key: the model and the effort level. Change either one and you start a fresh cache from scratch.

What it looks like

# Healthy cache (the normal state)
cache_read_input_tokens:      48,231   ← reused, ~10% of the input price
cache_creation_input_tokens:     412   ← only what's new this turn

# The turn right after switching models mid-session
cache_read_input_tokens:           0   ← zero hits
cache_creation_input_tokens:  48,643   ← reprocesses EVERYTHING at full price

A cached read costs ~10% of the input price (per the official docs). A cache miss pays full price, which is why that turn is slow and costs roughly 10× as much on input.

The three buckets you need to tell apart

1. What breaks the cache (avoid mid-task)

Switching models with /model (and the opusplan plan-mode toggle, which is a model switch in disguise).
Changing effort with /effort. Claude Code asks you to confirm precisely because it knows it costs you.
Turning on fast mode.
Denying a whole tool (a bare Bash or WebFetch; scoped rules like Bash(rm *) and all allow/ask rules break nothing).
Upgrading Claude Code and then resuming a long session: the first turn reprocesses the whole history, often the most expensive request you'll send.

2. What resets the cache by design, but is cheap (don't fear it)

/compact: invalidates only the conversation layer. The summary is generated by reading the cache, and the next turn rebuilds a much shorter history, so it's not the slow part. Run it at natural breaks between tasks. It barely costs you in spend, but not all your instructions come back the same: what survives /compact.
/clear: you start fresh on purpose.
/rewind: truncates back to a prefix that was already cached, so you abandon a path without rebuilding the cache, unlike /compact, which builds a new one.

3. What's always safe

Editing files, invoking skills and commands, /recap, switching permission modes, spawning subagents.
Delegating with /fork: the fork inherits your prefix, so its first request reuses your cache instead of paying from scratch (cheaper than a normal subagent).
Switching folders with /cd: it moves the session but appends the new folder's CLAUDE.md as a message instead of rewriting the system prompt, so the prefix holds.
Editing CLAUDE.md or the output style mid-session doesn't break the cache, but it also doesn't apply: Claude keeps using the version loaded at startup until the next /clear or restart.

About MCP and plugins: by default, on Opus and Sonnet, MCP tools are deferred with Tool Search, and connecting or disconnecting a server doesn't touch the cache. It only breaks when tools load into the prefix (Haiku, Vertex, a custom gateway, or alwaysLoad). In the common case, relax.

Check whether your cache is healthy

The two numbers live in the current_usage object the API returns on every response, and the easiest way to watch them is a statusline script:

cache_read_input_tokens: tokens served from cache (at ~10% of input).
cache_creation_input_tokens: tokens written to the cache this turn.

A high read-to-creation ratio means caching is working for you. If creation stays high turn after turn, something in your prefix keeps changing, so walk back through the list above.

Stretch the cache if you work in bursts (TTL)

The cache expires after a stretch of inactivity, and every hit resets the timer:

By default, 5 minutes.
On a Claude subscription: Claude Code requests the 1-hour TTL automatically, at no extra cost (your usage is included in the plan).
On an API key, Bedrock, or Vertex: you pay per token, so it stays at 5 minutes unless you set ENABLE_PROMPT_CACHING_1H=1.
Force 5 minutes for debugging: FORCE_PROMPT_CACHING_5M=1.

This is the flip side of 10 habits to save tokens: there you shrink the context; here you avoid paying for it twice. And to see the real spend, /usage and /stats lay it out. And to understand what actually cuts you off, how your usage limits work.

Official docs: How Claude Code uses prompt caching

Requirements: keeping the fast-mode header across toggles needs Claude Code v2.1.86+. The ENABLE_PROMPT_CACHING_1H, FORCE_PROMPT_CACHING_5M, and DISABLE_PROMPT_CACHING variables go in the env block of your settings.