A software bug that causes infinite recursion terminates with a stack overflow. A token bug — an agentic AI loop that recurses without a termination condition — terminates with a billing invoice. In 2026, as autonomous agents displaced simple chat completions as the primary AI interaction pattern, organisations discovered that the economics of agentic systems are fundamentally different from those of single-shot inference. Anthropic’s decision to move Claude agents onto metered billing across all subscription tiers was not a product update. It was a signal that the industry has reached an inflection point where agent economics require the same governance discipline as cloud infrastructure.
What Makes an Agent Loop Different from a Completion
A standard LLM completion is a bounded transaction: a prompt goes in, a response comes out, and the API call closes. Cost is deterministic at the time of the request, given the context window size.
An agentic loop is structurally different. It is a control flow in which the model’s output determines the next action, which in turn generates new input, which feeds another model call. The loop terminates when the model decides it has completed its objective — or when an external circuit breaker intervenes. If neither condition is met cleanly, the loop continues.
The cost of a single agent task is therefore not the cost of one API call. It is the sum of all API calls across the entire task execution:
\[C_{agent} = \sum_{k=1}^{K} \left( T_{in}^{(k)} \cdot P_{in} + T_{out}^{(k)} \cdot P_{out} \right)\]Where \(K\) is the number of steps the agent takes to complete (or fail to complete) the task. In a well-designed agent, \(K\) is bounded. In a poorly designed one, \(K\) is determined by the model’s own assessment of task completion — which can be confused, misdirected, or pathologically optimistic.
The Anatomy of a Runaway Loop
Consider an agent tasked with “audit the security posture of this repository and generate a remediation plan.” The intended execution path might be:
- Read repository structure
- Identify configuration files
- Check each file against a security policy
- Generate findings
- Draft remediation plan
- Return result
In practice, an under-specified agent with access to search, read, and web-browsing tools may:
- Read repository structure (1 API call)
- Identify 47 configuration files (1 API call)
- Check file 1 — finds a reference to an external library, decides to research the library’s CVE history (3 API calls)
- CVE research yields references to related libraries — agent expands scope (8 more API calls)
- Each new library reference expands the research scope further
- After 200+ API calls covering three hours of wall-clock time, the agent produces a report that references 340 libraries, most of which are irrelevant
This is not a theoretical failure mode. It is a documented pattern in every agentic framework that allows models to self-direct their tool use without explicit step budgets. The model is behaving rationally from its own perspective — gathering more information to produce a more comprehensive output. The operator never specified a limit, so the model imposed none.
Anthropic’s Response: Metered Agent Billing
Anthropic’s 2026 shift to metered agent billing across its subscription tiers reflects a recognition that flat-rate pricing for agentic workloads is structurally unsustainable — for both the operator running the agent and for Anthropic managing server capacity.
Under the new model, agent runs within Claude’s built-in tools (computer use, web search, file operations) are billed against a token meter that tracks both input and output across the entire agent session. Subscription tiers receive a monthly token allocation; overages are billed at standard API rates. The architecture creates a natural pressure toward efficient agent design: organisations that burn their allocation on runaway loops face real financial consequences, not just degraded performance.
The implications for teams building on the Claude API rather than the consumer product are equally significant. The metered model validates a set of engineering practices that had previously been regarded as optional:
| Practice | Pre-Metering Status | Post-Metering Status |
|---|---|---|
| Step count limits per agent run | Optional best practice | Financial necessity |
| Token budget per task type | Rarely implemented | Standard requirement |
| Agent run cost attribution | Difficult to instrument | Critical for billing accuracy |
| Circuit breakers on tool calls | Advanced implementations only | Baseline engineering requirement |
| Task scope specification in prompts | Improved output quality | Also cost control |
The Three Failure Modes
Agentic cost overruns concentrate around three patterns:
1. Scope creep without bounds. The agent is given access to tools that allow it to expand its own task definition. Web search is the most common vector — a task that starts as “summarise this document” becomes “research all claims in this document” becomes “verify every cited source” becomes a multi-hour research engagement. Mitigation: define tool access narrowly, and include explicit scope boundaries in system prompts (“do not follow external links; work only from the provided document”).
2. Retry loops on tool failure. When a tool call fails — a web page returns a 404, an API is rate-limited, a file cannot be parsed — a poorly calibrated agent will retry. Without a maximum retry count, transient failures become infinite loops. Mitigation: implement explicit retry budgets (maximum 3 retries per tool call) and instruct the model to report failures rather than loop on them.
3. Verification spirals. Some models, when instructed to be thorough, enter a pattern of self-verification — generating a result, evaluating it, finding it inadequate, regenerating, re-evaluating. Each iteration is a full generation cycle. Mitigation: separate generation from evaluation into distinct agent steps with independent token budgets, and cap the number of evaluation passes.
Designing Cost-Bounded Agents
The engineering pattern that prevents runaway loops is the explicit budget constraint, embedded at three levels:
Prompt-level constraints. The system prompt specifies maximum steps, maximum tool calls, and acceptable scope boundaries. Example: “Complete this task in no more than 10 steps. If you cannot complete it within 10 steps, return a partial result with an explanation of what remains.”
Framework-level guardrails. Agent orchestration frameworks (LangGraph, AutoGen, CrewAI) support maximum iteration counts and token budgets as first-class configuration. These should always be set explicitly — the default in most frameworks is unbounded.
Infrastructure-level circuit breakers. API gateway or proxy layers can enforce hard token limits per agent session, killing runs that exceed the budget regardless of model state. This is the safety net when prompt-level and framework-level constraints fail.
The cost model for a well-governed agent run is:
\[C_{bounded} = \min\left( C_{agent},\ K_{max} \cdot \bar{C}_{step} \right)\]Where \(K_{max}\) is the maximum permitted step count and \(\bar{C}_{step}\) is the average cost per agent step — a number that can be calibrated empirically during development and used to set \(K_{max}\) against a per-task budget constraint.
What the Metered Shift Means for Architecture
Anthropic’s move to metered agent billing is not an isolated vendor decision. It reflects a broader industry direction. As agents become the primary interface through which organisations interact with AI — replacing both chat UIs and batch pipelines — the cost per unit of value delivered will increasingly be measured in agent steps rather than tokens.
This changes the economic calculus for every AI architecture decision. The question is no longer only “which model produces the best output?” It becomes “which model produces sufficient output at the lowest step count?” A smaller, faster model that completes a task in four steps is economically superior to a more capable model that reaches the same outcome in twelve — even if the per-token price of the smaller model is higher.
The organisations that tune their agents for step efficiency now will have a durable cost advantage as the agentic paradigm matures. Those that optimise only for output quality, without constraining step count, will find that the meter runs whether or not the agent produces useful work.
Next in the series: Beyond the Token: Google’s Per-Minute Pricing and What It Means for the Economics of Real-Time AI — how Gemini Live API’s $0.005/minute rate is disrupting the token-per-call pricing model, and when it makes economic sense to pay for time rather than tokens.
