Anthropic's prompt caching is one of the most impactful cost-optimization features in the LLM ecosystem. Write a cache breakpoint at the right position in your prompt, and subsequent requests with matching prefixes get a 90% discount on input tokens.
There's just one catch: the cache TTL is 5 minutes. Hard-coded. Non-negotiable.
For a chatbot handling one-off Q&A, five minutes is fine. For an AI agent running a multi-hour coding session, it's catastrophic. And here's the thing — agent workloads are rapidly becoming the dominant use case for LLM APIs, not chatbots.
The Agent Token Multiplier Problem
Before we get to the TTL, let's understand why this matters. A typical agent session isn't a single request-response pair. It's a recursive loop:
User: "Add Redis caching to the auth middleware"
↓
Turn 1: Agent reads auth middleware → calls grep for Redis imports
Turn 2: Agent reads cache config → calls read_file on redis.ts
Turn 3: Agent writes implementation → runs typecheck
...
Turn 47: Agent fixes edge case → runs test suite
Turn 48: Agent handles review feedback → final commit
Each turn sends the entire conversation context — system prompt, tool schemas, and all previous messages — back to the model. By turn 40, that context might be 80,000 tokens, but 76,000 of them (95%) are bit-for-bit identical to what you sent on turn 39.
Without cache reuse, the cost per turn stays high even though 95% of the input hasn't changed. That's the "redundancy tax" — and it's why prompt caching was such a breakthrough.
How Anthropic's Prompt Caching Works (and Why 5 Minutes)
Anthropic's implementation is straightforward. You mark specific positions in your prompt with cache_control: {"type": "ephemeral"}. Anthropic stores the prefix up to that point, and if a subsequent request starts with the same bytes, you get a 90% discount on those cached input tokens.
The cache eviction logic is equally simple: any cache entry older than 5 minutes since its last read is deleted. From Anthropic's documentation:
"Cache entries have a 5-minute TTL. The TTL refreshes with each cache read."
The 5-minute TTL isn't arbitrary — it's a reasonable engineering tradeoff for a multi-tenant service. Cache storage costs real money (memory on inference hardware isn't cheap), and a short TTL ensures stale caches don't accumulate. For 95% of chat-based API usage, 5 minutes is generous.
But agent workloads break this model in three specific ways:
1. Real Sessions Don't Fit in 5-Minute Boxes
A developer using Claude Code doesn't send back-to-back API calls in rapid succession. The rhythm of a coding session looks more like this:
In a 2-hour session with 50 turns, the cache dies and restarts 10-15 times. Each cold start means paying full price on 75K+ tokens that should have been cached. The result: instead of a 90% discount, you might see only 50-60% effective savings.
2. The Coffee Break Penalty
The most common cache-killer isn't a long pause — it's a 6-minute distraction. A developer:
- Runs a test suite (4 minutes of waiting)
- Checks a Slack message (2 minutes)
- Returns to Claude Code
That's 6 minutes. Cache gone. Your next turn costs an extra $2-3 simply because you answered a coworker's question.
3. Multi-Turn Reasoning Breaks
Advanced agent workflows increasingly involve "thinking pauses" — the agent reflects, plans, or waits for external data. Some examples:
| Workflow | Typical Gap Between Turns | |----------|--------------------------| | Code → run tests → review results | 3-8 minutes | | Research → read docs → synthesize | 5-15 minutes | | Deployment → wait for CI → fix | 10-30 minutes | | Multi-agent handoff (Agent A → Agent B) | 1-5 minutes |
Every one of these gaps exceeds or flirts with the 5-minute boundary. The cache becomes a game of roulette.
The Cost Math: Cache Hit Rate vs Session Length
Let's quantify this. We instrumented a sample of 100 real Claude Code sessions (50+ turns each) and measured the effective cache hit rate against session duration:
For a 2-hour session (the median coding session), the effective cache hit rate drops to ~65%. That means 35% of your input tokens are billed at full price instead of the cached rate.
Let's put that in dollars for a typical Claude Code session using Claude Sonnet:
| Scenario | Effective Hit Rate | Cost per Session | |----------|-------------------|------------------| | Perfect caching (theoretical max) | 95% | $8.50 | | Anthropic 5-min TTL (2-hour session) | 65% | $22.40 | | No caching at all | 0% | $68.00 |
The 5-minute TTL costs you $13.90 per session compared to what you'd pay with session-lifetime caching. Across 10 sessions per week, that's $556/month in avoidable costs.
Current Workarounds (and Their Limitations)
The developer community has developed several strategies to cope. None are great.
Workaround 1: Keep-Alive Pings
Send a dummy request every 4 minutes to refresh the cache TTL:
Problems: Wastes tokens on pings. Adds complexity. Doesn't survive network interruptions. And it's a hack — you're fighting the API instead of working with it.
Workaround 2: Manual cache_control Injection
Some agent frameworks inject cache_control markers at strategic positions in the conversation — typically at the last system message, last tool definition, and recent message boundaries.
Problems: Fragile. Every framework implements it differently. Easy to get wrong (wrong positions = no caching, no error). And the 5-minute TTL still applies regardless of marker placement.
Workaround 3: Shorter Sessions
Some teams just accept the limitation and restart sessions more frequently: "Every 30 minutes, start a fresh Claude Code session."
Problems: Loses context. Claude has to re-read files and re-establish understanding. The first 5-10 turns of every new session are slower (cold start on comprehension, not just cache). Productivity hit is real.
Workaround 4: Self-Hosted Cache Layer
Some teams build their own cache proxy that stores prefixes in Redis/Memcached and intercepts API calls:
Client → Custom Proxy (Redis-backed prefix cache) → Anthropic API
Problems: This is a real engineering project. You need to handle chunked streaming, cache invalidation logic, and prefix matching with byte-level precision. The teams doing this successfully are spending weeks of engineering time on infrastructure that isn't their product.
The Real Problem: Cache Lifetime ≠ Session Lifetime
All of these workarounds try to solve the same fundamental mismatch:
A cache that expires at 5 minutes serves a session that lasts 120 minutes. That's a 24:1 mismatch. The cache is optimized for the API provider's infrastructure constraints (minimize memory usage per GPU), not for the application's actual usage pattern.
This isn't a criticism of Anthropic — they're transparent about the limitation, and building a multi-tenant inference service at their scale requires tradeoffs. But it does mean that agent developers are paying a significant "TTL tax" that's invisible in the per-request pricing.
What Session-Lifetime Caching Looks Like
The solution is conceptually simple: tie cache lifetime to the agent session, not to a fixed clock.
Instead of:
Cache TTL = 5 minutes (always, regardless of session state)
Use:
Cache TTL = as long as the session is active
Session ends → cache can be evicted
This transforms the cost curve:
Session-lifetime caching eliminates the sawtooth pattern — every turn benefits from cached prefixes. The cost grows linearly with session length instead of spiking after each cold start.
Synrouter: Session-Aware Caching, No Code Changes
This is exactly what we built Synrouter to do.
Synrouter sits between your agent and the LLM provider as a transparent proxy. It maintains a session state store that maps your agent session to cache entries, with lifetimes that match the actual user session — not an arbitrary clock.
Under the hood, Synrouter:
- Detects session boundaries — recognizes when a new session starts vs when it's a continuation
- Maintains session-scoped caches — cache entries live as long as the session is active (with a configurable session TTL, e.g., 30-minute idle timeout)
- Automatically injects optimal cache_control breakpoints — we handle the marker placement so your framework doesn't have to
- Compresses tool outputs — strips noise (ANSI codes, progress bars, redundant logs) before they bloat your context
The result: a 2-hour coding session that would have a 65% effective cache hit rate with Anthropic's 5-minute TTL achieves 85-95% hit rate with session-lifetime caching.
The Numbers on a Real Session
We took a real 85-turn Claude Code session — a developer building a Stripe billing integration over a 3-hour afternoon — and ran it through three scenarios:
| Scenario | Cache Hit Rate | Total Cost | vs Baseline | |----------|---------------|------------|-------------| | No caching (raw Anthropic API) | 0% | $71.40 | — | | Anthropic 5-min TTL (Claude's built-in) | 52% | $38.20 | −46% | | Synrouter session cache | 88% | $14.80 | −79% |
The developer took two coffee breaks and answered three Slack messages during this session. Each interruption killed the 5-minute cache. Synrouter's session-level cache survived all of them.
What This Means for Agent Teams
If your team runs 100 agent sessions per week (reasonable for a 3-5 person engineering team using Claude Code daily), the math looks like this:
| Approach | Weekly Cost | Monthly Cost | Annual Cost | |----------|------------|--------------|-------------| | Raw Anthropic | $7,140 | $30,940 | $371,280 | | Anthropic 5-min TTL | $3,820 | $16,553 | $198,636 | | Synrouter | $1,480 | $6,413 | $76,960 |
That's $121,676/year saved vs Anthropic's built-in caching — and $294,320/year saved vs no caching at all. These aren't hypothetical numbers; they're extrapolated from real session traces.
The Bottom Line
Anthropic's 5-minute cache TTL isn't a bug — it's a design choice optimized for their infrastructure, not for agent workloads. As AI agents become the dominant consumer of LLM APIs, this mismatch between cache lifetime and session lifetime will only become more expensive.
Session-lifetime caching isn't just a nice-to-have optimization. For teams running agents at scale, it's the difference between a sustainable cost model and a monthly surprise on the API bill.
Synrouter is in Early Access. If you're running agents in production and want session-level caching without building your own proxy infrastructure, click to sign up — we're onboarding users weekly.
Read next: How to Cut Claude Code API Costs by 85%