The price sheet for a frontier language model lists a flat rate per million tokens. It does not list the compounding cost effect that emerges when those tokens are assembled into long-context requests. This distinction — between the unit price of a token and the system cost of a context window — is where the most dangerous budget surprises of 2026 are occurring. A healthcare organisation recently discovered this gap at the scale of a $6 million overrun in a production retrieval-augmented generation (RAG) pipeline. Understanding what happened requires reasoning about the physics of context before you can reason about the economics.
Attention Is Quadratic
Modern transformer architectures implement a mechanism called self-attention, in which every token in a sequence attends to every other token. The computational complexity of this operation is \(O(n^2)\) with respect to sequence length. Double the context, and attention cost quadruples. This is not a bug or an implementation inefficiency — it is a fundamental property of the architecture that makes long-range contextual reasoning possible.
For the economics of API inference, this has a precise implication. The cost formula for a single API call is:
\[C_{call} = T_{in} \cdot P_{in} + T_{out} \cdot P_{out}\]But for a system that constructs context windows dynamically, the effective system cost across \(N\) calls with growing context is:
\[C_{system} = \sum_{i=1}^{N} \left( T_{in}^{(i)} \cdot P_{in} + T_{out}^{(i)} \cdot P_{out} \right)\]Where \(T_{in}^{(i)}\) typically grows with each conversation turn as retrieved documents and prior exchanges are prepended to the prompt. In a naive RAG implementation with no context pruning, a 20-turn conversation that starts at 2,000 tokens can reach 40,000 input tokens by turn 20 — a 20x cost multiplier relative to the first turn.
The Healthcare RAG Case
A large healthcare network deployed a clinical decision-support RAG system using a GPT-4-class model. The system was designed to surface relevant patient history, drug interaction data, and clinical guidelines in response to clinician queries. The architecture was textbook: a vector store of indexed documents, a retrieval layer that fetched the top-k relevant chunks, and an LLM that synthesised retrieved context into a clinical recommendation.
The initial cost model was straightforward. Estimated query volume: 50,000 queries per day. Average retrieved context: 8,000 tokens. Average output: 500 tokens. At Sonnet-class pricing ($3.00/1M input, $15.00/1M output):
\[C_{day} = 50{,}000 \times \left( \frac{8{,}000}{1{,}000{,}000} \times 3.00 + \frac{500}{1{,}000{,}000} \times 15.00 \right)\] \[C_{day} = 50{,}000 \times (0.024 + 0.0075) = 50{,}000 \times 0.0315 = \$1{,}575/\text{day}\]Annualised, that is approximately $575,000 — a manageable line in an enterprise health system budget.
What the cost model did not account for:
First, the system used a multi-turn conversational interface rather than single-shot queries. Clinicians asked follow-up questions, and each follow-up prepended the entire prior conversation plus the original retrieved context. By turn four of an average conversation, input token counts had tripled.
Second, the retrieval layer used a top-k of 15 documents rather than the 5 in the cost model, because early evaluations showed that broader retrieval improved recommendation quality. Each document averaged 1,200 tokens rather than the estimated 550.
Third, the system included a system prompt of 3,500 tokens containing clinical disclaimers, model persona, and output formatting instructions that appeared in every single API call. This overhead was invisible in early testing and never added to the cost model.
The actual average input token count per interaction, accounting for conversation turns, document retrieval, and system prompt, was approximately 47,000 tokens — nearly six times the projection. Combined with higher-than-anticipated query volume, the system ran at a daily cost of $16,400, an annual run rate of $6 million against a $575,000 budget.
Why Input-Output Asymmetry Amplifies the Problem
The token direction — whether tokens are input or output — matters significantly for cost management. Most frontier models price output tokens at a 3x to 5x premium over input tokens. This is rational from the provider’s perspective: generating tokens requires sequential autoregressive computation, while processing input tokens benefits from parallel attention.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Output/Input Ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4.0x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 5.0x |
| Gemini 1.5 Pro | $1.25 | $5.00 | 4.0x |
| Mistral Large | $2.00 | $6.00 | 3.0x |
For RAG architectures, this asymmetry is typically favourable — you spend heavily on input (retrieved context) but generate relatively compact outputs. However, if your system prompt or retrieval strategy causes input token inflation, the 4-5x output premium is the least of your problems.
The dangerous edge case is when output tokens also expand — for example, when a clinical system is configured to generate verbose explanations rather than structured summaries. A system that generates 2,500-word differential diagnoses at $15/1M tokens, multiplied by 50,000 daily queries, generates $1,875/day in output costs alone before accounting for any input tokens.
Structural Mitigations
The $6M overrun was recoverable because the architecture supported targeted optimisation. The interventions that brought costs to within 20% of the original model:
Context pruning and sliding windows. Rather than appending the full prior conversation on each turn, the system was refactored to maintain a compressed conversation summary alongside only the most recent two turns in full. This reduced per-turn input token growth from linear to near-constant.
Retrieval calibration. The top-k was reduced from 15 to 7, with documents truncated to 800 tokens using a sentence-boundary-aware truncation strategy. Retrieval quality, measured by clinician satisfaction scores, dropped by less than 3%.
System prompt compression. The 3,500-token system prompt was audited and reduced to 1,100 tokens by eliminating redundant disclaimers and converting verbose instructions to structured JSON-style directives.
Model tier routing. Simple queries (drug dosage lookups, ICD code clarifications) were routed to a Haiku-class model at $0.25/1M input. Complex clinical synthesis was retained on Sonnet. This alone reduced the average per-query cost by 34%.
Prompt caching. The static system prompt and a core set of clinical guidelines were placed in Anthropic’s server-side prompt cache, which charges a discounted rate for cache reads ($0.30/1M) versus full input pricing ($3.00/1M). With the system prompt appearing in every call, this yielded a 9% reduction in total input costs — meaningful at this query volume.
The Governance Imperative
The structural lesson from the healthcare overrun is not that RAG is expensive. It is that the unit economics of AI systems cannot be modelled from the price sheet alone. The actual cost depends on:
- The shape of your context windows across a realistic conversation distribution
- Your retrieval strategy and average document density
- The frequency and length of your system prompt
- The ratio of single-shot to multi-turn interactions
- Your model tier selection per query type
None of these variables appear on an API invoice. They only become visible when you instrument your system at the call level — logging token counts, tracking context growth per session, and attributing costs to the architectural decisions that drive them.
The organisations that invest in this instrumentation layer in 2026 will be able to defend their AI budgets to the CFO with the same precision they use to defend their AWS bills. The organisations that do not will continue discovering $6M surprises in their Q3 reconciliations.
Next in the series: The Infinite Spend Bug: Recursive Agent Loops and the New Economics of Agentic AI — how multi-step agents silently multiply API costs, why Anthropic moved to metered billing for agent subscriptions, and the circuit-breaker patterns that prevent runaway spend.
