<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://goldmanmalka.com/https://goldmanmalka.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://goldmanmalka.com/https://goldmanmalka.com/" rel="alternate" type="text/html" /><updated>2026-06-12T15:50:27+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/feed.xml</id><title type="html">Goldmanmalka.com</title><subtitle>An online notepad</subtitle><author><name>Eran Goldman-Malka</name></author><entry><title type="html">Are Users Compliant? Enterprise AI, Scholarly Retrieval, and the Obligations You Cannot Outsource</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/ai-users-scholarly-compliance/" rel="alternate" type="text/html" title="Are Users Compliant? Enterprise AI, Scholarly Retrieval, and the Obligations You Cannot Outsource" /><published>2026-06-12T07:00:00+00:00</published><updated>2026-06-12T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/ai-users-scholarly-compliance</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/ai-users-scholarly-compliance/"><![CDATA[<p>Frictionless AI retrieval creates a dangerous illusion: if Unpaywall found a legal open-access copy and Claude summarized it, the workflow must be compliant. It may not be. Organizations using Claude or similar tools for scholarly workflows remain responsible for lawful access, license compliance, data protection, and platform terms—even when OA discovery tools and AI vendors make retrieval feel automatic.</p>

<!--more-->

<h2 id="you-cannot-outsource-copyright-to-the-model">You Cannot Outsource Copyright to the Model</h2>

<p><strong>Fact:</strong> Anthropic’s Commercial Terms require customers to comply with applicable laws and the Usage Policy (<a href="https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance">legal and compliance</a>). Customers—not the model—bear responsibility for how outputs are used, including human review where appropriate in high-risk domains (<a href="https://www.anthropic.com/news/usage-policy-update">usage policy update</a>).</p>

<p><strong>Interpretation:</strong> Vendor compliance is bilateral. Anthropic governs your use of Claude; you govern your use of third-party scholarly content inside Claude. Passing a PDF to an API does not transfer copyright clearance any more than photocopying transfers ownership.</p>

<h2 id="the-enterprise-compliance-stack">The Enterprise Compliance Stack</h2>

<p>Deployers of AI scholarly workflows should maintain controls across six domains:</p>

<table>
  <thead>
    <tr>
      <th>Domain</th>
      <th>Question</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Platform tier</td>
      <td>Commercial/API vs consumer?</td>
    </tr>
    <tr>
      <td>Copyright &amp; license</td>
      <td>OA status, license, intended reuse</td>
    </tr>
    <tr>
      <td>Contractual</td>
      <td>Library licenses, publisher ToS, Unpaywall API terms</td>
    </tr>
    <tr>
      <td>Data protection</td>
      <td>GDPR, HIPAA, confidential research data</td>
    </tr>
    <tr>
      <td>Provenance</td>
      <td>Can you reconstruct what was retrieved and why?</td>
    </tr>
    <tr>
      <td>Governance</td>
      <td>Who approves high-risk literature use cases?</td>
    </tr>
  </tbody>
</table>

<p>Missing any row produces audit findings—even when every retrieved article was “open access.”</p>

<h2 id="commercial-versus-consumer-tier-traps">Commercial Versus Consumer Tier Traps</h2>

<p><strong>Fact:</strong> Anthropic’s consumer terms (Free, Pro, Max) differ from Commercial Terms on data retention and whether inputs may be used to improve models, subject to user settings (<a href="https://www.anthropic.com/news/updates-to-our-consumer-terms">consumer terms update</a>). Commercial API and Enterprise paths offer stronger defaults against training on customer content.</p>

<p><strong>Fact:</strong> Zero Data Retention (ZDR) arrangements for the API limit how long customer data is stored at rest (<a href="https://docs.anthropic.com/en/docs/build-with-claude/zero-data-retention">ZDR documentation</a>). ZDR is an organizational setting—not automatic on all tiers.</p>

<p><strong>Risk:</strong> Employees pasting paywalled or confidential PDFs into consumer Claude for “quick summaries” bypasses enterprise DPA, retention controls, and license restrictions simultaneously. Shadow AI is a scholarly-compliance problem, not only a security problem.</p>

<p><strong>Control:</strong> Mandate Commercial or API access for work-related scholarly workflows; block or monitor consumer endpoints where feasible; train researchers that OA discovery does not equal upload permission.</p>

<h2 id="confidentiality-gdpr-and-scholarly-pdfs">Confidentiality, GDPR, and Scholarly PDFs</h2>

<p>Scholarly articles are not automatically non-personal data. Clinical trials, case reports, genetics studies, and social-science fieldwork may contain personal data subject to GDPR (<em>jurisdiction-specific</em>) or protected health information under HIPAA in the United States.</p>

<p><strong>Fact:</strong> Anthropic offers HIPAA-ready API access with a Business Associate Agreement when ZDR is activated for the organization (<a href="https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance">legal and compliance — BAA section</a>).</p>

<p><strong>Risk analysis:</strong> Uploading a patient-enriched oncology PDF to a non-HIPAA, non-ZDR endpoint may constitute unauthorized processing—even if the article is OA. OA addresses copyright access, not data-protection classification.</p>

<p>Deployers should:</p>

<ul>
  <li>Classify uploads before they reach the model</li>
  <li>Complete DPIAs where systematic literature RAG processes external PDFs</li>
  <li>Document lawful basis and transfer mechanisms for EU operations</li>
  <li>Segregate identifiable human-subject data from general literature pipelines</li>
</ul>

<h2 id="library-licenses-versus-open-access">Library Licenses Versus Open Access</h2>

<p><strong>Fact:</strong> Institutions often subscribe to journals under licenses that restrict systematic download, text mining without addendum, or sharing with third-party processors—including AI vendors.</p>

<p><strong>Interpretation:</strong> Unpaywall finding an OA copy does not override a separate subscription agreement for the toll-access version an employee uploaded from a desktop. Conversely, absence of OA does not authorize uploading a subscription PDF to Claude if the license prohibits it.</p>

<p>Library and research offices should publish clear guidance: which retrieval paths are approved (OA resolver, interlibrary loan, licensed platform APIs) and which are not (paste into consumer chat).</p>

<h2 id="provenance-and-audit-trails">Provenance and Audit Trails</h2>

<p>Regulators and enterprise auditors increasingly ask not <em>what did the model say</em> but <em>what did it read</em>.</p>

<p>Minimum provenance log per retrieval event:</p>

<ul>
  <li>User or service identity</li>
  <li>DOI and Unpaywall <code class="language-plaintext highlighter-rouge">oa_status</code></li>
  <li>License string and <code class="language-plaintext highlighter-rouge">host_type</code></li>
  <li>Source URL fetched</li>
  <li>Timestamp and workflow purpose</li>
  <li>Model tier and retention flag (ZDR yes/no)</li>
  <li>Output disposition (ephemeral vs stored in RAG)</li>
</ul>

<p>For organizations deploying systems that may qualify as high-risk under the EU AI Act, transparency obligations toward deployers appear in Article 13 of Regulation (EU) 2024/1689 (<a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689">EUR-Lex</a>, <a href="https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-13">AI Act Service Desk Art. 13</a>). Applicability depends on system classification—another jurisdiction-specific analysis requiring counsel.</p>

<p>Anthropic’s <a href="https://www.anthropic.com/transparency/system-trust-reporting">Transparency Hub</a> documents platform-side enforcement and legal process handling; it does not replace deployer-side logging of scholarly sources.</p>

<h2 id="incident-scenarios">Incident Scenarios</h2>

<p><strong>NC-licensed corpus in a commercial product.</strong> An agent ingests CC BY-NC articles discovered via Unpaywall into a revenue-generating RAG application. License violation regardless of lawful access.</p>

<p><strong>Confidential proposal plus third-party PDFs.</strong> A researcher uploads a draft grant containing unpublished results and attaches licensed papers. Data leak plus potential copyright breach.</p>

<p><strong>RAG index retains paywalled content.</strong> Scraper misclassifies a toll-access PDF as OA; embeddings persist after takedown. Provenance failure turns a fetch error into a sustained compliance debt.</p>

<p><strong>Consumer-tier clinical summarization.</strong> Hospital team uses Pro-tier Claude on OA cancer literature containing trial participant details. Potential HIPAA/GDPR exposure independent of OA status.</p>

<h3 id="real-world-example">Real-World Example</h3>

<p>A hospital research team uses Claude to summarize recent OA cancer literature. Trial reports include demographic tables with small-cell counts. Staff use consumer-tier paste because “the papers are open access.” <strong>Copyright access</strong> may be fine; <strong>PHI handling</strong> is not. <strong>Tier selection</strong> failed. <strong>Provenance</strong> is absent—the compliance team cannot reconstruct which articles entered which sessions during an investigation.</p>

<h2 id="practical-checklist-users-and-deployers">Practical Checklist: Users and Deployers</h2>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Mandate Commercial/API tier for work-related scholarly workflows</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Activate ZDR (and BAA if PHI) where required</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Block or scan uploads of subscription-licensed PDFs without explicit rights</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Maintain retrieval provenance logs with DOI, license, URL, tier</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Train staff: OA discovery ≠ upload permission</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Complete DPIA for RAG over external scholarly content</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Maintain incident playbooks for wrongful ingestion or shadow AI use</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Align literature agents with library licensing guidance</li>
</ul>

<h2 id="risks-and-counterarguments">Risks and Counterarguments</h2>

<p><strong>“Anthropic’s terms protect us.”</strong> They protect the customer–vendor relationship, not customer–publisher relationships.</p>

<p><strong>“We only use OA papers.”</strong> OA does not eliminate NC restrictions, bronze ambiguity, PHI in articles, or API tier requirements.</p>

<p><strong>“Provenance is engineering overhead.”</strong> It is cheaper than post-incident reconstruction under regulatory inquiry.</p>

<h2 id="conclusion">Conclusion</h2>

<p>User compliance is operational, not theoretical. The tools make retrieval easy; governance makes it defensible. Publishers and rights holders face a mirror-image problem: legal OA increases readership while AI-scale aggregation erodes control they once exercised through access restrictions alone.</p>

<hr />

<p><strong>Enterprise AI governance for research workflows</strong>—tiering, logging, DPIA support. <a href="https://goldmanmalka.com/about">Contact me</a>.</p>

<hr />

<h3 id="relevant-sources">Relevant Sources</h3>

<ol>
  <li><strong>Legal and compliance — Claude Code</strong> — Anthropic — <a href="https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance">https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance</a></li>
  <li><strong>Zero Data Retention</strong> — Anthropic — <a href="https://docs.anthropic.com/en/docs/build-with-claude/zero-data-retention">https://docs.anthropic.com/en/docs/build-with-claude/zero-data-retention</a></li>
  <li><strong>Updates to Consumer Terms</strong> — Anthropic — <a href="https://www.anthropic.com/news/updates-to-our-consumer-terms">https://www.anthropic.com/news/updates-to-our-consumer-terms</a></li>
  <li><strong>Usage Policy Update</strong> — Anthropic — <a href="https://www.anthropic.com/news/usage-policy-update">https://www.anthropic.com/news/usage-policy-update</a></li>
  <li><strong>Transparency Hub</strong> — Anthropic — <a href="https://www.anthropic.com/transparency/system-trust-reporting">https://www.anthropic.com/transparency/system-trust-reporting</a></li>
  <li><strong>Regulation (EU) 2024/1689 (AI Act)</strong> — EUR-Lex — <a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689">https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689</a></li>
  <li><strong>Article 13 — AI Act</strong> — EU AI Act Service Desk — <a href="https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-13">https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-13</a></li>
  <li><strong>Negotiating Your Contract</strong> — Georgetown University Library — <a href="https://library.georgetown.edu/scholarly-communication/authors-rights-negotiate-your-contract">https://library.georgetown.edu/scholarly-communication/authors-rights-negotiate-your-contract</a></li>
</ol>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Compliance" /><category term="open access" /><category term="Claude" /><category term="AI compliance" /><category term="GDPR" /><category term="AI governance" /><category term="scholarly publishing" /><category term="provenance" /><category term="EU AI Act" /><summary type="html"><![CDATA[Frictionless AI retrieval creates a dangerous illusion: if Unpaywall found a legal open-access copy and Claude summarized it, the workflow must be compliant. It may not be. Organizations using Claude or similar tools for scholarly workflows remain responsible for lawful access, license compliance, data protection, and platform terms—even when OA discovery tools and AI vendors make retrieval feel automatic.]]></summary></entry><entry><title type="html">Are Creators Actually Paid? Open Access, APCs, and the AI Economics Gap</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/oa-creators-paid/" rel="alternate" type="text/html" title="Are Creators Actually Paid? Open Access, APCs, and the AI Economics Gap" /><published>2026-06-10T07:00:00+00:00</published><updated>2026-06-10T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/oa-creators-paid</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/oa-creators-paid/"><![CDATA[<p>Open access solved a reader-access problem: paywalls no longer block qualified researchers from reading scholarship. It did not automatically solve a creator-payment problem—and AI retrieval at scale may shift economic value toward intermediaries unless licensing and funding models explicitly account for new uses. When AI systems ingest OA literature, do creators and publishers actually get paid?</p>

<!--more-->

<h2 id="open-access-is-not-one-business-model">Open Access Is Not One Business Model</h2>

<p><strong>Fact:</strong> SPARC defines open access as “the free, immediate, online availability of research articles coupled with the rights to use these articles fully in the digital environment” (<a href="https://sparcopen.org/open-access/">SPARC Open Access</a>). That definition intentionally pairs access with reuse rights—it is not synonymous with “free to read behind a registration wall.”</p>

<p><strong>Interpretation:</strong> “Open” describes a publishing and rights architecture, not a single revenue model. Payment can flow through article processing charges (APCs), institutional agreements, funder mandates, society subsidies, or unfunded green archiving. AI-mediated discovery and summarization add a new value-capture layer that most legacy OA economics did not anticipate.</p>

<h2 id="who-pays-whom-a-payment-model-map">Who Pays Whom: A Payment Model Map</h2>

<table>
  <thead>
    <tr>
      <th>OA model</th>
      <th>Who typically pays</th>
      <th>Creator payment signal</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gold OA</td>
      <td>Author, institution, or funder (APC)</td>
      <td>Publisher receives APC; author may bear cost</td>
    </tr>
    <tr>
      <td>Hybrid OA</td>
      <td>APC for OA articles in subscription journal</td>
      <td>Publisher paid per OA article; journal remains hybrid</td>
    </tr>
    <tr>
      <td>Green OA</td>
      <td>Often no per-read payment</td>
      <td>Author retains some rights via repository deposit</td>
    </tr>
    <tr>
      <td>Bronze OA</td>
      <td>Reader pays nothing</td>
      <td>Publisher may monetize attention; no clear reuse license</td>
    </tr>
  </tbody>
</table>

<p>Unpaywall’s <code class="language-plaintext highlighter-rouge">oa_status</code> field encodes this taxonomy (<a href="https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-">oa_status definitions</a>). AI builders who treat all <code class="language-plaintext highlighter-rouge">is_oa: true</code> records as economically equivalent misunderstand what creators actually received when they published.</p>

<p>SPARC’s factsheet notes that CC BY is the standard license aligned with full OA reuse (<a href="https://sparcopen.org/wp-content/uploads/2016/01/SPARC-Open-Access-Factsheet.pdf">SPARC factsheet PDF</a>)—but CC BY governs <strong>permission</strong>, not <strong>payment</strong>.</p>

<h2 id="author-rights-versus-publisher-rights">Author Rights Versus Publisher Rights</h2>

<p>Many authors transfer copyright to publishers under standard publication agreements, then receive back specific rights—or publish under a Creative Commons license on the OA copy. The <a href="https://sparcopen.org/our-work/author-rights/">SPARC Author Addendum</a> is a legal instrument designed to help authors retain distribution, repository, and derivative-work rights that default contracts would otherwise withhold (<a href="https://sparcopen.org/our-work/author-rights/sparc-author-addendum-text/">addendum text</a>).</p>

<p><strong>Interpretation:</strong> “Open” for readers does not mean “paid fairly” for authors. An author may pay a $3,000 APC, publish under CC BY, and still receive zero revenue when a commercial AI product ingests the paper at scale—while remaining license-compliant.</p>

<p>Georgetown University Library’s guidance on negotiating contracts illustrates how publisher-retained rights often restrict which versions authors may share and where (<a href="https://library.georgetown.edu/scholarly-communication/authors-rights-negotiate-your-contract">negotiating your contract</a>). AI adds pressure because new uses (embedding, summarization markets, training) may fall outside rights authors thought they retained.</p>

<h2 id="where-ai-captures-value-in-the-chain">Where AI Captures Value in the Chain</h2>

<p>A typical AI-mediated literature workflow looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Unpaywall discovery → HTTP fetch → parse/chunk → embed or summarize → product output
</code></pre></div></div>

<p>Value accrues at the summarization and product layers—often to the AI vendor or the deployer’s application—not to the author or publisher. Creators may receive citation uplift without royalty participation.</p>

<p>The U.S. Copyright Office’s pre-publication Part 3 report on generative AI training (<a href="https://copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf">Part 3 PDF</a>) examines whether training on copyrighted works requires permission or compensation. <strong>Fact:</strong> The Office concludes that training may implicate exclusive rights and that fair use must be assessed case-by-case—not categorically. <strong>Fact:</strong> The report emphasizes that existing or feasible licensing markets weigh against fair use under the fourth factor.</p>

<p><strong>Interpretation:</strong> Even if <em>retrieval for summarization</em> and <em>training</em> are legally distinct activities, they share an economic theme: creators may lack payment when AI systems commercialize their scholarship.</p>

<h2 id="cc-by-and-commercial-ai-permission-without-payment">CC BY and Commercial AI: Permission Without Payment</h2>

<p><strong>Fact:</strong> Creative Commons Attribution 4.0 (CC BY) permits sharing and adaptation for any purpose, including commercially, provided you give appropriate credit, link to the license, and indicate changes (<a href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a>).</p>

<p><strong>Fact:</strong> CC BY does not require payment to authors.</p>

<p><strong>Risk analysis:</strong> A commercial literature-RAG vendor that ingests CC BY corpora discovered via Unpaywall may be <strong>license-compliant</strong> while <strong>economically contentious</strong>. Authors and funders who paid APCs expecting public benefit may not have expected uncompensated commercial aggregation.</p>

<p>Authors who intend to restrict commercial AI reuse should consider licenses such as CC BY-NC—but funders and journals increasingly mandate CC BY for maximum dissemination (<a href="https://katinamagazine.org/content/article/open-knowledge/2025/route-to-open-scholarly-ecosystem-through-cc-by">Katina Magazine on CC BY</a>). That tension is policy, not a technical bug in Unpaywall.</p>

<h2 id="jurisdiction-specific-notes-uk-and-eu">Jurisdiction-Specific Notes: UK and EU</h2>

<p><strong>United Kingdom:</strong> As of late 2025, the UK government had not adopted a broad commercial text-and-data-mining exception; official statements emphasize that copyright material generally cannot be used for AI development without permission (<a href="https://www.bbc.com/news/articles/cvg1gr5v333o">BBC reporting on government position</a>). The existing UK TDM exception is limited to non-commercial research—relevant to training, not merely reading OA PDFs.</p>

<p><strong>European Union:</strong> The CDSM Directive provides TDM exceptions with opt-out mechanisms for rightholders—primarily relevant to copying for mining/training, distinct from accessing an OA PDF a human could read. Deployers operating across borders need counsel on which regime applies.</p>

<p>These points are <strong>jurisdiction-specific</strong> and evolve quickly; do not treat this post as a current law memo.</p>

<h3 id="real-world-example">Real-World Example</h3>

<p>A professor publishes gold OA in a hybrid journal after paying a $3,000 APC, under CC BY. A commercial AI research tool ingests the PDF via Unpaywall, indexes millions of similar articles, and sells synthesized literature briefs to investors. <strong>Access</strong> is lawful. <strong>Many reuse forms</strong> are CC BY-compliant if attribution is correct. The author receives citations but no AI royalty. The funder’s compliance office asks whether the APC purchased only reader access—or also permission for unchecked third-party commercialization. That question has no universal answer in today’s licensing market.</p>

<h2 id="what-creators-should-verify">What Creators Should Verify</h2>

<p>Before publishing OA—or when reviewing an existing corpus exposed to AI retrieval—creators and research offices should verify:</p>

<ol>
  <li><strong>OA type and license</strong> on the actual copy Unpaywall would index (gold/green/hybrid/bronze)</li>
  <li><strong>Funder policy</strong> (Plan S, NIH, Wellcome, etc.) via tools such as the JISC Open Policy Finder referenced in <a href="https://www.openaccess.nl/en/publishing/copyright-and-open-licenses">openaccess.nl guidance</a></li>
  <li><strong>Whether CC BY-NC better matches non-commercial intent</strong>—and whether funders permit it</li>
  <li><strong>Publisher reservations</strong> on text and data mining where applicable</li>
  <li><strong>Contract addenda</strong> (SPARC or institutional) that clarify repository and derivative rights</li>
</ol>

<h2 id="practical-checklist-side-b-creators">Practical Checklist: Side B (Creators)</h2>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Know your OA type and license on the indexed copy</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Check funder OA and licensing requirements before accepting publisher terms</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Evaluate whether CC BY-NC fits your intent if commercial AI ingestion is a concern</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Track publisher AI and data-mining policy separately from OA reader access</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Use SPARC or institutional addenda where negotiation is possible</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Document APC funding source and what rights you believed you were buying</li>
</ul>

<h2 id="risks-and-counterarguments">Risks and Counterarguments</h2>

<p><strong>“OA means authors chose to give everything away.”</strong> OA chooses a rights architecture—often CC BY—not necessarily uncompensated commercial exploitation by AI intermediaries.</p>

<p><strong>“AI summarization helps authors through visibility.”</strong> Citation uplift is real but uneven; it is not a substitute for payment where markets might otherwise exist.</p>

<p><strong>“Funders require CC BY, so authors have no choice.”</strong> Policy tension is genuine; advocacy and licensing markets are the venue—not covert paywall bypassing.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Payment and permission diverge in the AI era. Legal OA clears many access barriers; it does not automatically align creator economics with new AI value chains. Organizations that deploy AI tools face their own parallel obligation: staying compliant when they consume scholarly content, regardless of how frictionless retrieval has become.</p>

<hr />

<p><strong>Publishing teams and research offices:</strong> I help map OA licenses to AI use cases and funder compliance. <a href="https://goldmanmalka.com/about">Get in touch</a>.</p>

<hr />

<h3 id="relevant-sources">Relevant Sources</h3>

<ol>
  <li><strong>Open Access</strong> — SPARC — <a href="https://sparcopen.org/open-access/">https://sparcopen.org/open-access/</a></li>
  <li><strong>SPARC Open Access Factsheet (PDF)</strong> — SPARC — <a href="https://sparcopen.org/wp-content/uploads/2016/01/SPARC-Open-Access-Factsheet.pdf">https://sparcopen.org/wp-content/uploads/2016/01/SPARC-Open-Access-Factsheet.pdf</a></li>
  <li><strong>Author Rights &amp; SPARC Addendum</strong> — SPARC — <a href="https://sparcopen.org/our-work/author-rights/">https://sparcopen.org/our-work/author-rights/</a></li>
  <li><strong>CC BY 4.0 Deed</strong> — Creative Commons — <a href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</a></li>
  <li><strong>Copyright and AI Part 3 (Training)</strong> — U.S. Copyright Office — <a href="https://copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf">https://copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf</a></li>
  <li><strong>Copyright and AI hub</strong> — U.S. Copyright Office — <a href="https://copyright.gov/ai/">https://copyright.gov/ai/</a></li>
  <li><strong>Copyright and Open Licenses</strong> — openaccess.nl — <a href="https://www.openaccess.nl/en/publishing/copyright-and-open-licenses">https://www.openaccess.nl/en/publishing/copyright-and-open-licenses</a></li>
  <li><strong>OA status definitions</strong> — Unpaywall Support — <a href="https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-">https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-</a></li>
</ol>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Compliance" /><category term="open access" /><category term="copyright" /><category term="scholarly publishing" /><category term="AI governance" /><category term="author rights" /><category term="APC" /><category term="Creative Commons" /><category term="AI compliance" /><summary type="html"><![CDATA[Open access solved a reader-access problem: paywalls no longer block qualified researchers from reading scholarship. It did not automatically solve a creator-payment problem—and AI retrieval at scale may shift economic value toward intermediaries unless licensing and funding models explicitly account for new uses. When AI systems ingest OA literature, do creators and publishers actually get paid?]]></summary></entry><entry><title type="html">Where Is the Compliance Boundary? Open Access, Circumvention, Scraping, and Terms of Service</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/oa-compliance-boundary/" rel="alternate" type="text/html" title="Where Is the Compliance Boundary? Open Access, Circumvention, Scraping, and Terms of Service" /><published>2026-06-05T07:00:00+00:00</published><updated>2026-06-05T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/oa-compliance-boundary</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/oa-compliance-boundary/"><![CDATA[<p>The compliance line in AI-assisted scholarly retrieval is not drawn at “open versus closed.” It runs between locating publisher-authorized open-access copies and any practice that bypasses technical or contractual access controls, exceeds license scope, or violates site or API terms—even when an AI system could technically retrieve the bytes. Here we map where lawful OA ends and circumvention, scraping, and policy violations begin.</p>

<!--more-->

<h2 id="a-four-ring-boundary-model">A Four-Ring Boundary Model</h2>

<p>Compliance teams benefit from a layered model rather than a binary “legal/illegal” flag. Think of four concentric rings:</p>

<p><strong>Ring 1 — Discovery services.</strong> Unpaywall API, library link resolvers, discovery layers (EBSCO, Primo, WorldCat). Lowest friction when used as documented. Contractual obligations here are primarily API terms and rate limits (<a href="https://unpaywall.org/products/api">REST API</a>).</p>

<p><strong>Ring 2 — Lawful access endpoints.</strong> Publisher OA pages, institutional repositories, preprint servers, subject archives. The bytes live here. Access may be free to humans; automated access may still be constrained.</p>

<p><strong>Ring 3 — Permitted use.</strong> License terms (CC BY, CC BY-NC, publisher custom), embargo rules on green OA, citation requirements. This ring governs what you may do after lawful access.</p>

<p><strong>Ring 4 — Prohibited or high-risk conduct.</strong> Toll-access content without authorization, credential sharing, DRM or technical protection measure (TPM) circumvention, ignoring rightholder opt-outs, training commercial models on NC-licensed corpora without permission.</p>

<p>An AI pipeline can pass Ring 1 and Ring 2 while failing Ring 3 or Ring 4. That is the compliance boundary this post addresses.</p>

<h2 id="what-is-not-circumvention-when-done-correctly">What Is Not Circumvention (When Done Correctly)</h2>

<p><strong>Fact:</strong> The Unpaywall browser extension invites users to “skip the paywall on millions of peer-reviewed journal articles” by clicking when a legal free copy exists (<a href="https://unpaywall.org/products/extension">extension</a>). The marketing language reflects user experience on toll-access landing pages—not unauthorized access to content publishers have withheld.</p>

<p><strong>Fact:</strong> Library link-resolver integrations use Unpaywall data to offer OA copies when the institution lacks a subscription (<a href="https://support.unpaywall.org/support/solutions/articles/44001874811-link-resolver-integrations">link resolver integrations</a>). When no OA location exists, redirecting to the toll-access publisher page is described as correct behavior.</p>

<p><strong>Interpretation:</strong> These patterns are structurally different from services that obtain content through stolen credentials, exploit kits, or infringing mirrors. Unpaywall is not comparable to those tools—and compliance discussions should not collapse them into one category.</p>

<h2 id="what-is-circumvention-or-high-risk-analog">What Is Circumvention or High-Risk Analog</h2>

<p>The following patterns sit outside lawful OA discovery even when Unpaywall or similar tools appear somewhere in the toolchain:</p>

<ul>
  <li><strong>Credential laundering:</strong> Using Unpaywall to identify an article, then fetching the toll-access version via shared institutional proxy cookies or pirated credentials.</li>
  <li><strong>Status ignoring:</strong> Disregarding <code class="language-plaintext highlighter-rouge">closed</code> or missing OA signals and brute-forcing alternate URLs, leaked copies, or third-party infringing hosts.</li>
  <li><strong>TPM breaking:</strong> Circumventing digital rights management or access controls on publisher platforms (<em>jurisdiction-specific</em>: U.S. DMCA §1201 and EU copyright rules on technological protection measures may apply; this is not legal advice).</li>
</ul>

<p>Sci-Hub and similar infringing services are not OA discovery—they are unauthorized distribution. Any AI pipeline that routes through them fails compliance regardless of how the DOI was resolved.</p>

<h2 id="scraping-versus-reading">Scraping Versus Reading</h2>

<p>Even when Ring 2 access is free for human readers, Ring 3 and site policies may restrict automation.</p>

<p><strong>Risk analysis:</strong> Headless browsers fetching thousands of OA PDFs overnight can trigger bot management, breach-of-contract claims under terms of service, and—in the U.S.—debates about computer fraud and abuse statutes that courts have applied inconsistently (<em>jurisdiction-specific; not legal advice</em>). The Congressional Research Service provides a neutral overview of generative AI and copyright intersections (<a href="https://www.congress.gov/crs-product/LSB10922">CRS LSB10922</a>), but site-access questions often turn on contract and computer-access law rather than copyright alone.</p>

<p>The operative distinction for governance:</p>

<table>
  <thead>
    <tr>
      <th>Pattern</th>
      <th>Typical risk profile</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>User-initiated single fetch for session summarization</td>
      <td>Lower</td>
    </tr>
    <tr>
      <td>Unattended corpus harvest into persistent storage</td>
      <td>Higher</td>
    </tr>
    <tr>
      <td>Commercial RAG index built from publisher domains at scale</td>
      <td>Highest without explicit license</td>
    </tr>
  </tbody>
</table>

<p><strong>Fact:</strong> Unpaywall’s API terms limit daily call volume (<a href="https://unpaywall.org/products/api">REST API</a>). That is a contractual boundary distinct from—but related to—publisher scraping norms.</p>

<h2 id="browser-extension-versus-ai-agent">Browser Extension Versus AI Agent</h2>

<p>The Unpaywall extension operates in a human context: one page, one article, one click. An AI agent operates in a machine context: loops, retries, parallel fetches, unpredictable volume, and downstream storage.</p>

<p>Compliance controls for agents should include:</p>

<ul>
  <li><strong>Source allowlists</strong> (repositories, known OA hosts—not open web wildcards)</li>
  <li><strong>Rate limits</strong> per domain and per workflow</li>
  <li><strong>Robots.txt and ToS review</strong> before automated fetch is enabled</li>
  <li><strong>No credential injection</strong> into headless sessions</li>
  <li><strong>Separation of session read vs persistent retain</strong> (display for answer synthesis vs embed in vector DB)</li>
</ul>

<p>Anthropic’s <a href="https://www.anthropic.com/news/usage-policy-update">Usage Policy updates</a> clarify platform-side prohibitions on malicious computer and network compromise. Your retrieval agent’s behavior must comply with both vendor policy and publisher policy—the stricter effective control wins.</p>

<h2 id="edge-cases-that-break-naive-automation">Edge Cases That Break Naive Automation</h2>

<p><strong>Bronze OA.</strong> Free on the publisher site but <code class="language-plaintext highlighter-rouge">license</code> null in Unpaywall metadata (<a href="https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-">oa_status definitions</a>). Access may be lawful; reuse for AI products is ambiguous. Default-deny for automated commercial reuse unless counsel approves.</p>

<p><strong>Version mismatch.</strong> Green repository copies may be preprints or author-accepted manuscripts, not the version of record. Citation norms, commercial use, and publisher policies may differ (<a href="https://www.openaccess.nl/en/publishing/copyright-and-open-licenses">openaccess.nl copyright guide</a>).</p>

<p><strong>Paratext and non-article DOIs.</strong> Unpaywall indexes Crossref DOIs broadly; not every DOI is a journal article (<a href="https://support.unpaywall.org/support/solutions/folders/44000384007">support FAQ</a>). Garbage DOIs in, garbage compliance out.</p>

<p><strong>Extension-only bronze detection.</strong> The extension may surface bronze when it finds a PDF on-page without going through the API—behavior documented in Unpaywall support materials. Agents relying only on API metadata may classify the same article differently.</p>

<h3 id="real-world-example">Real-World Example</h3>

<p>A compliance team audits a “research agent” that queries Unpaywall, then fetches ten thousand PDFs overnight from publisher domains into cloud storage for RAG indexing. <strong>Ring 1</strong> (API use) may be lawful if rate limits were respected. <strong>Ring 2</strong> (many copies are OA) may be lawful for access. <strong>Ring 3</strong> (licenses vary; bronze and NC articles mixed in) is unreviewed. <strong>Ring 4</strong> (site ToS, scraping norms) is likely violated regardless of OA status. Discovery was not the failure point—automation scale and reuse without license triage was.</p>

<h2 id="technical-compliance-policy-engine-design">Technical Compliance: Policy Engine Design</h2>

<p>Build—or buy—a policy engine that evaluates each candidate document before fetch and again before retain:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Inputs:  doi, oa_status, license, host_type, robots_txt, intended_action
Actions: display | summarize_ephemeral | store | embed | train
Decision: ALLOW | REVIEW | DENY
</code></pre></div></div>

<p>Recommended defaults:</p>

<ul>
  <li><strong>DENY</strong> automated retain/embed when <code class="language-plaintext highlighter-rouge">license</code> is null</li>
  <li><strong>REVIEW</strong> all green repository copies for version and embargo</li>
  <li><strong>DENY</strong> fetch from toll-access URLs unless separate authorization exists</li>
  <li><strong>ALLOW</strong> ephemeral summarize of CC BY with attribution logging</li>
</ul>

<h2 id="risks-and-counterarguments">Risks and Counterarguments</h2>

<p><strong>“OA means we can scrape freely.”</strong> Bronze and green complicate this. Free to read ≠ free to bulk-ingest for commercial AI.</p>

<p><strong>“We’re just like the browser extension.”</strong> Scale, automation, storage, and product embedding change the risk profile. Equivalence is an engineering claim, not a legal defense.</p>

<p><strong>“The API gave us a URL, so we’re covered.”</strong> The API gives a location, not a use license for your product.</p>

<h2 id="practical-checklist">Practical Checklist</h2>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Document prohibited sources (toll-access, credential-gated, known infringing mirrors)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Implement robots.txt and ToS review for automated fetch</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Separate “read for user session” from “retain for RAG/training”</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Default-deny bronze and unlicensed copies for automated reuse</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Legal review for cross-border teams (EU CDSM, UK non-commercial TDM only)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Audit agent volume and domain spread quarterly</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>The compliance boundary is behavioral and contextual, not merely technical. Passing Unpaywall’s API is necessary but never sufficient. On the other side of the access question, creators and publishers must also ask whether open access and AI economics actually pay those who produce the work—a tension this boundary analysis does not resolve on its own.</p>

<hr />

<p><strong>Need a boundary assessment for your AI retrieval stack?</strong> <a href="https://goldmanmalka.com/about">Let’s map discovery, access, and reuse rings to your controls</a>.</p>

<hr />

<h3 id="relevant-sources">Relevant Sources</h3>

<ol>
  <li>
    <table>
      <tbody>
        <tr>
          <td>**Browser Extension</td>
          <td>Unpaywall** — OurResearch — <a href="https://unpaywall.org/products/extension">https://unpaywall.org/products/extension</a></td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><strong>Link Resolver Integrations</strong> — Unpaywall Support — <a href="https://support.unpaywall.org/support/solutions/articles/44001874811-link-resolver-integrations">https://support.unpaywall.org/support/solutions/articles/44001874811-link-resolver-integrations</a></li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>**Integrations</td>
          <td>Unpaywall** — OurResearch — <a href="https://unpaywall.org/integrations">https://unpaywall.org/integrations</a></td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><strong>Copyright and Open Licenses</strong> — openaccess.nl — <a href="https://www.openaccess.nl/en/publishing/copyright-and-open-licenses">https://www.openaccess.nl/en/publishing/copyright-and-open-licenses</a></li>
  <li><strong>Usage Policy Update</strong> — Anthropic — <a href="https://www.anthropic.com/news/usage-policy-update">https://www.anthropic.com/news/usage-policy-update</a></li>
  <li><strong>Generative Artificial Intelligence and Copyright Law</strong> — U.S. Congress CRS — <a href="https://www.congress.gov/crs-product/LSB10922">https://www.congress.gov/crs-product/LSB10922</a> (jurisdiction-specific)</li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>**REST API</td>
          <td>Unpaywall** — OurResearch — <a href="https://unpaywall.org/products/api">https://unpaywall.org/products/api</a></td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><strong>Unpaywall Support FAQ</strong> — Unpaywall — <a href="https://support.unpaywall.org/support/solutions/folders/44000384007">https://support.unpaywall.org/support/solutions/folders/44000384007</a></li>
</ol>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Compliance" /><category term="open access" /><category term="Unpaywall" /><category term="copyright" /><category term="AI governance" /><category term="scholarly publishing" /><category term="terms of service" /><category term="scraping" /><category term="AI compliance" /><summary type="html"><![CDATA[The compliance line in AI-assisted scholarly retrieval is not drawn at “open versus closed.” It runs between locating publisher-authorized open-access copies and any practice that bypasses technical or contractual access controls, exceeds license scope, or violates site or API terms—even when an AI system could technically retrieve the bytes. Here we map where lawful OA ends and circumvention, scraping, and policy violations begin.]]></summary></entry><entry><title type="html">Can Claude Use Unpaywall Legally? Access, Discovery, and the First Compliance Question</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/claude-unpaywall-legal/" rel="alternate" type="text/html" title="Can Claude Use Unpaywall Legally? Access, Discovery, and the First Compliance Question" /><published>2026-06-02T07:00:00+00:00</published><updated>2026-06-02T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/claude-unpaywall-legal</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/claude-unpaywall-legal/"><![CDATA[<p>When teams wire Claude or another AI assistant into scholarly research workflows, the first technical question is usually practical: <em>can we use Unpaywall to find full-text papers?</em> The first compliance question is harder: <em>does discovery through an open-access index grant permission to copy, store, summarize, or commercialize what we retrieve?</em> <strong>Unpaywall can lawfully help locate publisher-authorized or repository-hosted open-access copies of scholarly articles, but using that discovery in an AI retrieval workflow still leaves separate, and often stricter, questions about copying, terms of service, licensing, and downstream reuse.</strong></p>

<!--more-->

<h2 id="two-sides-of-the-same-pipeline">Two Sides of the Same Pipeline</h2>

<p>Every scholarly retrieval workflow has two stakeholders whose incentives do not automatically align.</p>

<p><strong>Side A — the AI user or builder:</strong> Can a system like Claude lawfully use Unpaywall-style open-access discovery to reach content?</p>

<p><strong>Side B — the creator, publisher, or rights holder:</strong> Even when access is legal, are you sure you get paid, stay compliant, and remain protected when AI systems ingest or repackage your work?</p>

<p>Both sides deserve equal seriousness. This article does not frame the topic as paywall bypassing. Unpaywall is an open-access <strong>discovery</strong> service, not a piracy tool. Throughout, we distinguish <strong>facts</strong> (what a system does), <strong>interpretation</strong> (how law or policy may apply), and <strong>risk analysis</strong> (what can go wrong operationally).</p>

<p>The focus here is mechanics and first-order questions around <em>discovery</em> and initial retrieval—not model training, bulk redistribution, or commercial exploitation.</p>

<h2 id="what-unpaywall-actually-is-and-is-not">What Unpaywall Actually Is (and Is Not)</h2>

<p><strong>Fact:</strong> <a href="https://unpaywall.org/">Unpaywall</a> is an open database operated by the nonprofit OurResearch. It indexes open-access locations for scholarly articles identified by Crossref Digital Object Identifiers (DOI), drawing on data from tens of thousands of publishers and repositories.</p>

<p><strong>Fact:</strong> When the browser extension detects a DOI on a page, it queries the Unpaywall API to retrieve the best known open-access location for that article (<a href="https://unpaywall.org/faq">FAQ</a>). The API returns structured JSON including <code class="language-plaintext highlighter-rouge">is_oa</code>, <code class="language-plaintext highlighter-rouge">oa_status</code>, and a <code class="language-plaintext highlighter-rouge">best_oa_location</code> object with fields such as <code class="language-plaintext highlighter-rouge">license</code>, <code class="language-plaintext highlighter-rouge">host_type</code>, and URLs (<a href="https://unpaywall.org/data-format">data format</a>).</p>

<p><strong>Fact:</strong> Unpaywall does not crack paywalls. It points to copies that publishers or repositories have made available. If no open-access location exists, integrations may redirect to the toll-access publisher page—behavior that link-resolver documentation describes as working correctly (<a href="https://support.unpaywall.org/support/solutions/articles/44001874811-link-resolver-integrations">link resolver integrations</a>).</p>

<p><strong>Interpretation:</strong> Describing Unpaywall as a “piracy tool” mischaracterizes its stated function. Libraries, discovery systems, and link resolvers worldwide integrate it as legitimate scholarly infrastructure (<a href="https://unpaywall.org/integrations">integrations</a>). That said, <strong>misuse of URLs returned by Unpaywall</strong>—for example, unattended bulk scraping that violates publisher terms—is a separate compliance problem from discovery itself.</p>

<h2 id="open-access-status-is-not-a-single-permission">Open-Access Status Is Not a Single Permission</h2>

<p>Unpaywall assigns each article an <code class="language-plaintext highlighter-rouge">oa_status</code>: <code class="language-plaintext highlighter-rouge">gold</code>, <code class="language-plaintext highlighter-rouge">green</code>, <code class="language-plaintext highlighter-rouge">hybrid</code>, <code class="language-plaintext highlighter-rouge">bronze</code>, or <code class="language-plaintext highlighter-rouge">closed</code> (<a href="https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-">oa_status definitions</a>). These labels describe <em>where</em> and <em>how</em> openness was achieved—not a blanket grant of rights for every downstream use.</p>

<table>
  <thead>
    <tr>
      <th>Status</th>
      <th>Plain-language meaning</th>
      <th>Compliance note</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gold</td>
      <td>Published in a fully OA journal</td>
      <td>Often carries explicit license (e.g., CC BY)</td>
    </tr>
    <tr>
      <td>Hybrid</td>
      <td>OA article in otherwise toll-access journal</td>
      <td>License usually present on OA copy</td>
    </tr>
    <tr>
      <td>Green</td>
      <td>Archived in repository</td>
      <td>Version and license may differ from publisher VoR</td>
    </tr>
    <tr>
      <td>Bronze</td>
      <td>Free to read on publisher site</td>
      <td><code class="language-plaintext highlighter-rouge">license</code> field may be null—reuse rights unclear</td>
    </tr>
    <tr>
      <td>Closed</td>
      <td>No OA location indexed</td>
      <td>Discovery may still return publisher toll page</td>
    </tr>
  </tbody>
</table>

<p><strong>Risk:</strong> Treating <code class="language-plaintext highlighter-rouge">is_oa: true</code> as permission for AI ingestion, summarization-for-resale, embedding in a commercial RAG product, or model training without reading the license is one of the most common compliance failures we see in design reviews.</p>

<p>SPARC defines open access as free, immediate, online availability <strong>plus the rights to use articles fully in the digital environment</strong> (<a href="https://sparcopen.org/open-access/">SPARC Open Access</a>). That definition makes clear that OA is not merely “free to read”—but the <em>extent</em> of reuse rights still depends on the specific license attached to each copy.</p>

<h2 id="how-ai-retrieval-differs-from-clicking-the-green-tab">How AI Retrieval Differs from Clicking the Green Tab</h2>

<p>The Unpaywall browser extension is a human-in-the-loop pattern: a researcher on a publisher page clicks a tab when a legal OA copy exists (<a href="https://unpaywall.org/products/extension">extension</a>). An AI agent workflow is structurally different:</p>

<ol>
  <li>User query or task trigger</li>
  <li>DOI resolution (from citation, Crossref, or metadata)</li>
  <li>Unpaywall API call (<code class="language-plaintext highlighter-rouge">GET /v2/{doi}?email=...</code>)</li>
  <li>HTTP fetch of the returned OA URL</li>
  <li>PDF/HTML parse, chunk, embed, or summarize</li>
  <li>Optional storage in vector DB, cache, or logs</li>
  <li>Output to user or downstream product</li>
</ol>

<p>Each step can invoke different obligations: API terms, copyright reproduction, database rights (jurisdiction-specific), site terms of service, and license conditions on reuse.</p>

<p><strong>Provenance requirement:</strong> At fetch time, log the DOI, <code class="language-plaintext highlighter-rouge">oa_status</code>, license string, <code class="language-plaintext highlighter-rouge">host_type</code>, URL fetched, timestamp, and document version where identifiable. Without this record, later audit—regulatory, contractual, or internal—is guesswork.</p>

<h2 id="three-compliance-layers-discovery-access-reuse">Three Compliance Layers: Discovery, Access, Reuse</h2>

<p>Think of compliance as three stacked layers, not one switch:</p>

<p><strong>Layer 1 — Discovery compliance:</strong> Are you using the Unpaywall API as intended? Requests require a valid email parameter; the documented limit is 100,000 calls per day (<a href="https://unpaywall.org/products/api">REST API</a>). High-volume or commercial deployments should contact OurResearch about data feeds rather than assuming anonymous bulk access is appropriate. Agent loops that hammer the API invite HTTP 429 responses and reputational harm.</p>

<p><strong>Layer 2 — Access compliance:</strong> Is the copy at the returned URL lawfully available for your fetch? For hybrid or gold OA with a clear license, access is typically uncontroversial. For bronze OA or green repository copies, access may be lawful while reuse remains ambiguous.</p>

<p><strong>Layer 3 — Reuse compliance:</strong> Does your intended use—summarize, store, adapt, train, redistribute—fall within the license, publisher policy, and applicable law? <strong>Access does not imply reuse rights.</strong> A CC BY article permits broad reuse with attribution (<a href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a>); a bronze article with no license metadata does not give you a safe default.</p>

<h2 id="where-claude-and-anthropic-fit">Where Claude and Anthropic Fit</h2>

<p>Anthropic’s <a href="https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance">legal and compliance documentation</a> makes an important distinction: your relationship with the AI vendor is governed by <strong>Commercial Terms</strong> (API, Team, Enterprise) or <strong>Consumer Terms</strong> (Free, Pro, Max). These terms regulate data retention, training on your inputs, and acceptable use—they do <strong>not</strong> grant copyright permissions for third-party scholarly works.</p>

<p><strong>Fact:</strong> Under Commercial Terms, Anthropic does not use customer API content for model training by default. Consumer tiers have different retention and opt-in training settings (<a href="https://www.anthropic.com/news/updates-to-our-consumer-terms">consumer terms update</a>).</p>

<p><strong>Interpretation:</strong> Choosing the right tier is a compliance control for <em>your organization’s data</em>, not a substitute for publisher licenses. The <a href="https://www.anthropic.com/news/usage-policy-update">Usage Policy</a> also restricts certain high-risk uses and malicious automation; it does not replace Creative Commons or publisher terms.</p>

<h2 id="can-claude-use-unpaywall-legally-a-nuanced-answer">Can Claude Use Unpaywall Legally? A Nuanced Answer</h2>

<p>There is no universal yes/no answer—outcomes depend on use case, license, jurisdiction, and scale.</p>

<p><strong>Likely lower-risk pattern (interpretation, not legal conclusion):</strong> A human-supervised research assistant on Commercial/API terms queries Unpaywall for a known DOI, fetches a hybrid-OA article licensed CC BY, summarizes it with proper attribution for an internal literature review, and does not retain full text beyond the session—provided publisher and repository terms of service do not prohibit the automated fetch.</p>

<p><strong>Higher-risk patterns:</strong></p>

<ul>
  <li>Unattended agents bulk-downloading bronze OA (<code class="language-plaintext highlighter-rouge">license: null</code>) for commercial RAG</li>
  <li>Combining Unpaywall metadata with institutional proxy credentials to fetch toll-access versions</li>
  <li>Feeding full text into consumer-tier Claude where retention and training settings may not meet enterprise confidentiality requirements</li>
  <li>Inferring training rights from <code class="language-plaintext highlighter-rouge">is_oa</code> alone</li>
</ul>

<p><strong>Uncertain and jurisdiction-specific:</strong> Whether transient copies made solely for summarization qualify as fair use or fair dealing; how EU CDSM text-and-data-mining exceptions apply; the UK’s narrow TDM exception limited to non-commercial research. The <a href="https://copyright.gov/ai/">U.S. Copyright Office AI initiative</a> is actively examining training and reuse questions but does not resolve retrieval-for-summarization in every scenario.</p>

<h3 id="real-world-example">Real-World Example</h3>

<p>A biotech startup wires a Claude agent to Unpaywall for competitive intelligence. The agent fetches a hybrid-OA oncology paper tagged CC BY in Unpaywall’s API response. <strong>Discovery</strong> is consistent with Unpaywall’s design. <strong>Access</strong> to the publisher-hosted OA PDF is likely lawful. <strong>Reuse</strong> becomes the crux: embedding chunks in a commercial product requires CC BY attribution compliance, human oversight for medical claims, and a decision about whether persistent vector storage counts as “adaptation” under the license. None of that is settled by the green tab alone.</p>

<h2 id="risks-and-counterarguments">Risks and Counterarguments</h2>

<p><strong>“Unpaywall says it’s legal, so our pipeline is legal.”</strong> Unpaywall locates OA copies authorized by publishers and repositories. It does not warrant your use case, jurisdiction, volume, or downstream product.</p>

<p><strong>“If it’s on the web, Claude can read it.”</strong> Visibility is not permission. Robots.txt, terms of service, and license terms may restrict automated access even when a human could read the same page for free.</p>

<p><strong>Conflating extension UX with headless scraping.</strong> The extension operates at human scale with single-article intent. Agents operating at corpus scale change the risk profile even when the underlying OA copies are the same.</p>

<h2 id="practical-checklist-side-a-builders">Practical Checklist: Side A (Builders)</h2>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Confirm API use within rate limits with a valid contact email</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Capture <code class="language-plaintext highlighter-rouge">oa_status</code>, <code class="language-plaintext highlighter-rouge">license</code>, and <code class="language-plaintext highlighter-rouge">host_type</code> for every retrieved document</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Map intended use (read, summarize, store, train, redistribute) against license text</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Use Commercial/API terms for enterprise retrieval agents</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Default-deny pipeline progression when <code class="language-plaintext highlighter-rouge">license</code> is null (bronze) unless legal review approves</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Document human oversight for summaries used in regulated or high-stakes domains</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Never infer training or commercial reuse rights from <code class="language-plaintext highlighter-rouge">is_oa</code> alone</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>Legal open-access discovery is a legitimate scholarly infrastructure pattern—and for many research workflows, Unpaywall is an appropriate starting point. It is the <strong>start</strong> of compliance, not the end. The boundary between lawful OA access and circumvention, scraping, and terms-of-service violations is a separate layer—governed by scale, automation, and license scope as much as by whether a copy was discoverable in the first place.</p>

<hr />

<p><strong>Designing agentic research workflows?</strong> If you need a compliance architecture review—discovery, provenance, data retention, and license-aware retrieval—<a href="https://goldmanmalka.com/about">contact me directly</a> to discuss a focused engagement.</p>

<hr />

<h3 id="relevant-sources">Relevant Sources</h3>

<ol>
  <li>
    <table>
      <tbody>
        <tr>
          <td>**FAQ</td>
          <td>Unpaywall** — OurResearch — <a href="https://unpaywall.org/faq">https://unpaywall.org/faq</a> — Defines extension/API discovery model and DOI-based lookup.</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>**REST API</td>
          <td>Unpaywall** — OurResearch — <a href="https://unpaywall.org/products/api">https://unpaywall.org/products/api</a> — API requirements, email parameter, rate limits.</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>**Data Format</td>
          <td>Unpaywall** — OurResearch — <a href="https://unpaywall.org/data-format">https://unpaywall.org/data-format</a> — Schema for <code class="language-plaintext highlighter-rouge">license</code>, <code class="language-plaintext highlighter-rouge">host_type</code>, OA locations.</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><strong>OA status definitions</strong> — Unpaywall Support — <a href="https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-">https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-</a> — Permission granularity by status type.</li>
  <li><strong>Legal and compliance (Claude Code)</strong> — Anthropic — <a href="https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance">https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance</a> — Commercial vs consumer terms, usage policy linkage.</li>
  <li><strong>Open Access</strong> — SPARC — <a href="https://sparcopen.org/open-access/">https://sparcopen.org/open-access/</a> — Authoritative OA definition including reuse rights.</li>
  <li><strong>Deed: CC BY 4.0</strong> — Creative Commons — <a href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</a> — Standard license conditions for many OA articles.</li>
  <li><strong>Copyright and Artificial Intelligence</strong> — U.S. Copyright Office — <a href="https://copyright.gov/ai/">https://copyright.gov/ai/</a> — U.S. federal framing for AI/copyright intersection (jurisdiction-specific).</li>
</ol>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Compliance" /><category term="open access" /><category term="Unpaywall" /><category term="Claude" /><category term="copyright" /><category term="AI governance" /><category term="scholarly publishing" /><category term="provenance" /><category term="AI compliance" /><summary type="html"><![CDATA[When teams wire Claude or another AI assistant into scholarly research workflows, the first technical question is usually practical: can we use Unpaywall to find full-text papers? The first compliance question is harder: does discovery through an open-access index grant permission to copy, store, summarize, or commercialize what we retrieve? Unpaywall can lawfully help locate publisher-authorized or repository-hosted open-access copies of scholarly articles, but using that discovery in an AI retrieval workflow still leaves separate, and often stricter, questions about copying, terms of service, licensing, and downstream reuse.]]></summary></entry><entry><title type="html">From Tokenmaxxing to Economic Governance: The 2026 AI Roadmap for CTOs Who Want a 2027 Budget</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/2026-ai-roadmap/" rel="alternate" type="text/html" title="From Tokenmaxxing to Economic Governance: The 2026 AI Roadmap for CTOs Who Want a 2027 Budget" /><published>2026-05-28T07:00:00+00:00</published><updated>2026-05-28T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/2026-ai-roadmap</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/2026-ai-roadmap/"><![CDATA[<p>Over the past eight posts, this series has examined the 2026 AI token economy from six distinct angles: the Uber budget collapse, the physics of context scaling, recursive agent loops, per-minute pricing disruption, KV cache optimisation, infrastructure power constraints, and the local inference break-even. Each post was a close-up on a specific failure mode or opportunity. This final post is the wide-angle view — a synthesis of the full series into a governance framework that translates individual insights into organisational practice. The thesis is simple: <strong>the era of “tokenmaxxing” — deploying AI at maximum capability without cost discipline — is over.</strong> The organisations that will thrive in 2027 are those that implement economic governance over their AI stacks before their next budget cycle, not after.</p>

<!--more-->

<h2 id="what-tokenmaxxing-means-and-why-it-ends">What “Tokenmaxxing” Means and Why It Ends</h2>

<p><strong>Tokenmaxxing</strong> is the implicit strategy of the 2023–2025 AI adoption wave: use the most capable model available, maximise context windows, deploy agents liberally, and treat AI costs as a growth investment that will be justified by productivity gains. The rationale was defensible in the early adoption phase — when AI capabilities were advancing rapidly, the cost of under-investing in capability was higher than the cost of over-spending on tokens.</p>

<p>That calculus has inverted. Frontier model capabilities have plateaued relative to the cost curve. The gap between a Haiku-class model and a Sonnet-class model on most production tasks is far smaller than the 10x pricing gap between them. The gap between a 4-bit quantised Mistral 7B and a frontier API model on classification and extraction is operationally negligible for the majority of enterprise workloads. Meanwhile, the budget consequences of unconstrained API spend have compounded — as Uber discovered — to the point where AI tooling is consuming budget at a rate that is not sustainable without explicit governance.</p>

<p>Economic governance does not mean restricting AI usage. It means ensuring that every dollar of AI spend is allocated to the workload tier that requires it, that usage is instrumented and attributed, and that the organisation has the data to make rational decisions about where to invest and where to optimise.</p>

<h2 id="the-seven-layer-governance-framework">The Seven-Layer Governance Framework</h2>

<p>The synthesis of this series yields a framework organised across seven domains, each corresponding to a specific cost vector addressed in the preceding posts.</p>

<p><strong>Layer 1: Token Metering and Attribution</strong>
Every API call must be attributed to a cost centre, a team, a product, and a workload type. This is the prerequisite for all other governance. Without attribution, optimisation is guesswork. Instrument at the API gateway layer — not the application layer — so that attribution is infrastructure-enforced rather than developer-maintained.</p>

<p><strong>Layer 2: Context Window Discipline</strong>
The context tax is real and compounding. Establish organisational standards for what belongs in a context window: what is a required system prompt, what is optional, what is never appropriate. Measure average input token counts per workload. Any workload averaging more than 10,000 input tokens per call without a clear quality justification is a candidate for context pruning.</p>

<p><strong>Layer 3: Agent Step Budgets</strong>
Every agentic workflow must have an explicit step budget — a maximum number of tool calls or LLM invocations per task execution. No agent should run unbounded. The budget should be set empirically: measure the step distribution for a sample of successful task executions, set the cap at the 95th percentile, and implement circuit-breaker logic to halt and report on runs that hit the cap.</p>

<p><strong>Layer 4: Billing Model Routing</strong>
Not all workloads should use the same billing model. Token-per-call pricing is optimal for low-token-density text workloads. Per-minute pricing (Gemini Live API) is optimal for real-time audio and high-token-density multimodal workloads. Build routing logic that assigns each workload class to its economically optimal billing model, and re-evaluate the routing quarterly as pricing evolves.</p>

<p><strong>Layer 5: Prompt Cache Architecture</strong>
Every production prompt must be audited for its static and dynamic components. The static portion — system prompts, knowledge bases, fixed instructions — should be cached. Target a cache hit rate of 80% or higher for high-volume workloads. Treat the cache hit rate as a production SLO, alongside latency and error rate.</p>

<p><strong>Layer 6: Infrastructure and Make/Buy Policy</strong>
Establish a volume threshold policy for each workload type: below the threshold, API inference is the default; above the threshold, local inference on SLMs is evaluated. For most enterprise workloads, this threshold is in the range of 3–6 million documents or API calls per month at Haiku-tier pricing. Review the policy annually as hardware costs, model quality, and API pricing evolve.</p>

<p><strong>Layer 7: Upstream Risk Management</strong>
Monitor provider infrastructure constraints — data centre capacity, power availability, hardware supply chains — as leading indicators of API price trajectories. Maintain multi-provider capability (not just multi-provider integration, but tested and production-ready fallback routes) for all critical AI workloads. Provider concentration risk is infrastructure risk, not just vendor risk.</p>

<h2 id="the-2026-maturity-model">The 2026 Maturity Model</h2>

<p>Organisations fall into four tiers of AI economic maturity:</p>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Description</th>
      <th>Indicators</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Tier 0: Unaware</strong></td>
      <td>No cost instrumentation, no governance</td>
      <td>Budget surprises, post-hoc reconciliation</td>
    </tr>
    <tr>
      <td><strong>Tier 1: Measured</strong></td>
      <td>API costs attributed and visible</td>
      <td>Monthly spend dashboards, per-team visibility</td>
    </tr>
    <tr>
      <td><strong>Tier 2: Managed</strong></td>
      <td>Costs instrumented and governed</td>
      <td>Caps enforced, caching implemented, model tiers assigned</td>
    </tr>
    <tr>
      <td><strong>Tier 3: Optimised</strong></td>
      <td>Continuous cost/quality optimisation</td>
      <td>Routing logic, SLM deployment, cache hit SLOs</td>
    </tr>
    <tr>
      <td><strong>Tier 4: Governed</strong></td>
      <td>Economic governance as org capability</td>
      <td>AI FinOps function, policy framework, audit trail</td>
    </tr>
  </tbody>
</table>

<p>Most organisations that have been running AI in production for 12+ months are at Tier 1 or early Tier 2. The Uber case is a Tier 0 organisation that scaled to high spend without progressing through the tiers. The organisations that will define the AI economics of 2027 are moving from Tier 2 to Tier 3 now.</p>

<h2 id="the-priority-action-sequence">The Priority Action Sequence</h2>

<p>If you are a CTO reading this series and wondering where to start, the sequence that delivers the highest ROI per unit of engineering effort:</p>

<ol>
  <li>
    <p><strong>Instrument first.</strong> Deploy API gateway-level token metering with team and workload attribution. If you cannot see the spend, you cannot govern it. Time to value: 2–4 weeks.</p>
  </li>
  <li>
    <p><strong>Cache your system prompts.</strong> Enable prompt caching for every production workload with a stable system prompt longer than 1,000 tokens. This requires only a prompt restructuring and a cache header — no infrastructure change. Expected savings: 40–70% of input token costs for affected workloads. Time to value: 1 week.</p>
  </li>
  <li>
    <p><strong>Assign model tiers.</strong> Audit every production workload against the model you are using. For classification, extraction, and structured generation workloads on Opus or Sonnet, migrate to Haiku or Flash. Expected savings: 60–90% on those workloads with minimal quality impact. Time to value: 2–4 weeks per workload.</p>
  </li>
  <li>
    <p><strong>Budget your agents.</strong> For every agentic workflow, add a step counter and a circuit breaker. Cap at 95th-percentile empirical step counts. Time to value: 1–2 weeks per agent.</p>
  </li>
  <li>
    <p><strong>Evaluate local inference.</strong> For your top two or three high-volume workloads, run the break-even calculation from Post 7. If any are above the threshold, pilot a local SLM deployment. Time to value: 4–8 weeks for pilot.</p>
  </li>
  <li>
    <p><strong>Build the routing layer.</strong> Once tiers and models are assigned, invest in routing infrastructure that enforces the assignments programmatically. This is the foundation of Tier 3 maturity. Time to value: 4–8 weeks.</p>
  </li>
</ol>

<h2 id="the-cost-formula-for-the-governed-organisation">The Cost Formula for the Governed Organisation</h2>

<p>A fully governed AI stack does not pay the full token price for its workloads. It pays:</p>

\[C_{governed} = \sum_{w \in W} V_w \cdot \left( f_{cache}(w) \cdot P_{cache} + (1 - f_{cache}(w)) \cdot P_{tier}(w) \right) + C_{local}\]

<p>Where:</p>
<ul>
  <li>\(W\) is the set of all workloads</li>
  <li>\(V_w\) is the volume for workload \(w\)</li>
  <li>\(f_{cache}(w)\) is the cache hit rate for workload \(w\)</li>
  <li>\(P_{cache}\) is the cache read price</li>
  <li>\(P_{tier}(w)\) is the price for the assigned model tier for workload \(w\)</li>
  <li>\(C_{local}\) is the amortised TCO of local inference deployments</li>
</ul>

<p>In practice, a governed organisation pays 30–50% of what an equivalent ungoverned organisation pays for the same AI output quality. At the scale of a mid-to-large enterprise, this difference is measured in millions of dollars annually — and it is the difference between an AI budget that is defensible in a CFO review and one that produces the next Uber headline.</p>

<h2 id="a-final-word-on-2027">A Final Word on 2027</h2>

<p>The AI capability curve will continue upward. Models will become more capable, context windows will expand further, and agentic workflows will become more autonomous and more deeply embedded in engineering processes. Every one of those trends increases both the value of AI investment and the potential cost of ungoverned AI consumption.</p>

<p>The organisations that build economic governance now — when the costs are measurable and the frameworks are available — will enter 2027 with a structural advantage. They will be able to absorb new capabilities without budget crises because they have the instrumentation to understand what they are buying and the architecture to route spend appropriately.</p>

<p>The organisations that do not will face the same reckoning Uber faced, at higher absolute costs, with more entrenched dependencies and less flexibility to change course.</p>

<hr />

<p><strong>Don’t let your 2027 budget evaporate.</strong> The guardrails Uber missed are not complex — they are the governance primitives this series has described. If your organisation is at Tier 0 or Tier 1 and you need to close the gap before your next planning cycle, <a href="https://goldmanmalka.com">contact me directly</a> to discuss a focused AI FinOps engagement: instrumentation, model tier audit, caching architecture, and a prioritised roadmap calibrated to your specific stack and spend profile.</p>

<hr />

<p><strong>This post concludes the 2026 AI Token Economy series.</strong> The full series:</p>

<ol>
  <li><a href="/uber-token-burn/">The Great Token Burn: How Uber Exhausted Its 2026 AI Budget by May</a></li>
  <li><a href="/physics-of-context/">The Context Tax: Quadratic Cost Scaling and the $6M Healthcare RAG Overrun</a></li>
  <li><a href="/agentic-loops/">The Infinite Spend Bug: Recursive Agent Loops and the Metered Future of Agentic AI</a></li>
  <li><a href="/per-minute-pricing/">Beyond the Token: Google’s Per-Minute Pricing and the Disruption of Real-Time AI Economics</a></li>
  <li><a href="/prompt-caching-kv/">KV Cache Optimization: Why Server-Side Prompt Caching Is the New S3 of AI Infrastructure</a></li>
  <li><a href="/infrastructure-bottleneck/">Power as the New Token: Gartner’s $1.37 Trillion Infrastructure Bet and the Physics of AI at Scale</a></li>
  <li><a href="/sovereign-ai-slms/">The Local Inference ROI: 4-Bit Quantization, SLMs, and the Case for Bypassing the API</a></li>
  <li><strong>From Tokenmaxxing to Economic Governance: The 2026 AI Roadmap for CTOs Who Want a 2027 Budget</strong> <em>(this post)</em></li>
</ol>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Economics" /><category term="token economy" /><category term="cost management" /><category term="AI governance" /><category term="ROI" /><category term="AI roadmap" /><category term="economic governance" /><category term="budget" /><category term="CTO strategy" /><summary type="html"><![CDATA[Over the past eight posts, this series has examined the 2026 AI token economy from six distinct angles: the Uber budget collapse, the physics of context scaling, recursive agent loops, per-minute pricing disruption, KV cache optimisation, infrastructure power constraints, and the local inference break-even. Each post was a close-up on a specific failure mode or opportunity. This final post is the wide-angle view — a synthesis of the full series into a governance framework that translates individual insights into organisational practice. The thesis is simple: the era of “tokenmaxxing” — deploying AI at maximum capability without cost discipline — is over. The organisations that will thrive in 2027 are those that implement economic governance over their AI stacks before their next budget cycle, not after.]]></summary></entry><entry><title type="html">The Local Inference ROI: 4-Bit Quantization, SLMs, and the Case for Bypassing the API</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/sovereign-ai-slms/" rel="alternate" type="text/html" title="The Local Inference ROI: 4-Bit Quantization, SLMs, and the Case for Bypassing the API" /><published>2026-05-25T07:00:00+00:00</published><updated>2026-05-25T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/sovereign-ai-slms</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/sovereign-ai-slms/"><![CDATA[<p>Every post in this series has examined a different dimension of API cost: token pricing, context scaling, agentic multiplication, per-minute billing, cache optimisation, and infrastructure constraints. The implicit assumption throughout has been that API inference is the only option — that your choice is which provider, which model, and which optimisation technique to apply within the API billing model. In 2026, that assumption requires re-examination. <strong>Small Language Models (SLMs) combined with 4-bit quantisation have reached a capability and cost profile that makes local inference economically rational for a well-defined and growing class of enterprise workloads.</strong> Understanding when to bypass the API entirely is now a first-order strategic decision, not a research-stage exploration.</p>

<!--more-->

<h2 id="the-slm-landscape-in-2026">The SLM Landscape in 2026</h2>

<p>The past 18 months have seen a qualitative shift in what small models can do. The models that define the current frontier of efficient inference:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Parameters</th>
      <th>Context Window</th>
      <th>Notable Strength</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Microsoft Phi-3 Mini</td>
      <td>3.8B</td>
      <td>128K</td>
      <td>Reasoning, code generation</td>
    </tr>
    <tr>
      <td>Microsoft Phi-3 Medium</td>
      <td>14B</td>
      <td>128K</td>
      <td>Broad task coverage</td>
    </tr>
    <tr>
      <td>Mistral 7B v0.3</td>
      <td>7B</td>
      <td>32K</td>
      <td>Instruction following, multilingual</td>
    </tr>
    <tr>
      <td>Mistral Nemo</td>
      <td>12B</td>
      <td>128K</td>
      <td>Long-context, function calling</td>
    </tr>
    <tr>
      <td>Meta Llama-3.1 8B</td>
      <td>8B</td>
      <td>128K</td>
      <td>General purpose, strong benchmarks</td>
    </tr>
    <tr>
      <td>Meta Llama-3.1 70B</td>
      <td>70B</td>
      <td>128K</td>
      <td>Near-frontier quality at smaller footprint</td>
    </tr>
    <tr>
      <td>Google Gemma 2 9B</td>
      <td>9B</td>
      <td>8K</td>
      <td>Efficient, strong for size</td>
    </tr>
    <tr>
      <td>Qwen-2.5 7B</td>
      <td>7B</td>
      <td>128K</td>
      <td>Code, maths, Chinese-English</td>
    </tr>
  </tbody>
</table>

<p>These are not toy models. Phi-3 Mini at 3.8 billion parameters scores above GPT-3.5-Turbo on MMLU (Massive Multitask Language Understanding) benchmarks. Llama-3.1 8B approaches GPT-4-level performance on structured reasoning tasks when deployed with careful prompt engineering. For well-scoped enterprise tasks — classification, extraction, summarisation, code review, structured data generation — models in the 7–14B range meet quality bars that would have required frontier models as recently as 2023.</p>

<h2 id="4-bit-quantisation-the-economics-of-precision-reduction">4-Bit Quantisation: The Economics of Precision Reduction</h2>

<p>A full-precision (FP16 or BF16) 7B parameter model requires approximately <strong>14GB of GPU VRAM</strong> — the full model weights loaded into memory for inference. This demands at minimum a single A10G GPU (24GB VRAM) for comfortable operation, or two consumer-grade RTX 4090s (24GB each).</p>

<p><strong>4-bit quantisation</strong> reduces each weight from a 16-bit floating-point number to a 4-bit integer representation, shrinking the memory footprint by 75%. A 7B model in 4-bit (GGUF format via llama.cpp, or GPTQ/AWQ for GPU inference) fits in approximately <strong>4–5GB of memory</strong> — within the VRAM budget of a single RTX 4090, or on CPU RAM with acceptable throughput for moderate-volume workloads.</p>

<p>The quality trade-off is measurable but, for most production tasks, acceptable:</p>

<table>
  <thead>
    <tr>
      <th>Task Category</th>
      <th>FP16 Quality</th>
      <th>4-bit Quality</th>
      <th>Degradation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Text classification</td>
      <td>Baseline</td>
      <td>~97% of baseline</td>
      <td>~3%</td>
    </tr>
    <tr>
      <td>Named entity extraction</td>
      <td>Baseline</td>
      <td>~95% of baseline</td>
      <td>~5%</td>
    </tr>
    <tr>
      <td>Code generation (simple)</td>
      <td>Baseline</td>
      <td>~93% of baseline</td>
      <td>~7%</td>
    </tr>
    <tr>
      <td>Complex multi-step reasoning</td>
      <td>Baseline</td>
      <td>~85% of baseline</td>
      <td>~15%</td>
    </tr>
    <tr>
      <td>Creative generation</td>
      <td>Baseline</td>
      <td>~88% of baseline</td>
      <td>~12%</td>
    </tr>
  </tbody>
</table>

<p>For classification and extraction workloads — which represent the majority of enterprise AI deployments — 4-bit quality is operationally indistinguishable from full precision. The degradation concentrates in complex reasoning and creative tasks, which are also the workloads that most justify using a frontier model in the first place.</p>

<h2 id="the-total-cost-of-ownership-calculation">The Total Cost of Ownership Calculation</h2>

<p>The break-even analysis between local inference and API inference requires modelling the full TCO of a local deployment against the API spend it replaces. Consider a mid-sized enterprise with the following workload profile:</p>

<ul>
  <li><strong>Task:</strong> Document classification and structured data extraction</li>
  <li><strong>Volume:</strong> 500,000 documents per month</li>
  <li><strong>Average document length:</strong> 800 tokens</li>
  <li><strong>Average extraction output:</strong> 200 tokens</li>
  <li><strong>API cost (Claude Haiku — the appropriate tier for this task):</strong></li>
</ul>

<p>\(C_{API/month} = 500{,}000 \times \left( \frac{800}{1{,}000{,}000} \times 0.25 + \frac{200}{1{,}000{,}000} \times 1.25 \right)\)
\(= 500{,}000 \times (0.0002 + 0.00025) = 500{,}000 \times 0.00045 = \$225/\text{month}\)</p>

<p>At this volume, the API cost is modest — $225/month, or $2,700 annually. A local deployment would not be justified on cost grounds alone.</p>

<p>Now scale to a high-volume deployment:</p>

<ul>
  <li><strong>Volume:</strong> 10,000,000 documents per month (document-heavy enterprise: legal, financial services, healthcare)</li>
  <li><strong>API cost at Haiku pricing:</strong> $4,500/month = <strong>$54,000/year</strong></li>
</ul>

<p><strong>Local deployment TCO (4-bit Mistral 7B on single A100 80GB server):</strong></p>

<table>
  <thead>
    <tr>
      <th>Cost Category</th>
      <th>One-Time</th>
      <th>Monthly</th>
      <th>Annual</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A100 80GB server (used/leased)</td>
      <td>$18,000</td>
      <td>—</td>
      <td>—</td>
    </tr>
    <tr>
      <td>Co-location power + rack</td>
      <td>—</td>
      <td>$400</td>
      <td>$4,800</td>
    </tr>
    <tr>
      <td>Engineering setup (one-time)</td>
      <td>$8,000</td>
      <td>—</td>
      <td>—</td>
    </tr>
    <tr>
      <td>Maintenance + monitoring</td>
      <td>—</td>
      <td>$200</td>
      <td>$2,400</td>
    </tr>
    <tr>
      <td><strong>Total Year 1</strong></td>
      <td> </td>
      <td> </td>
      <td><strong>$33,200</strong></td>
    </tr>
    <tr>
      <td><strong>Total Year 2+</strong></td>
      <td> </td>
      <td> </td>
      <td><strong>$7,200</strong></td>
    </tr>
  </tbody>
</table>

<p>At 10M documents/month, the API spend is $54,000/year. Local deployment costs $33,200 in year one and $7,200 in year two. <strong>Break-even occurs during year one; year two savings are $46,800.</strong></p>

<p>The formula for the volume threshold at which local inference becomes economically superior:</p>

\[V_{break-even} = \frac{C_{capex} + C_{opex\_year1}}{C_{API\_per\_doc}}\]

<p>Where \(C_{API\_per\_doc} = 0.00045\) per document in this example, and \(C_{capex} + C_{opex\_year1} = \$26,200\) (excluding engineering):</p>

\[V_{break-even} = \frac{26{,}200}{0.00045} \approx 58{,}200{,}000 \text{ documents/year} \approx 4{,}850{,}000 \text{ documents/month}\]

<p>Below approximately 5 million documents per month at Haiku pricing, the API wins on economics. Above that threshold, local inference wins in year one and dominates from year two onward.</p>

<h2 id="deployment-architecture-llamacpp-and-ollama">Deployment Architecture: llama.cpp and Ollama</h2>

<p>The tooling for local SLM deployment has matured significantly. <strong>llama.cpp</strong> provides CPU and GPU inference for GGUF-quantised models with minimal dependencies — a single binary, a model file, and an OpenAI-compatible API endpoint that drops in as a replacement for cloud API calls with no application code changes.</p>

<p><strong>Ollama</strong> wraps llama.cpp with a model registry, CLI management, and an HTTP API, reducing deployment to a three-command sequence:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Pull the model</span>
ollama pull mistral:7b-instruct-q4_K_M

<span class="c"># Serve it</span>
ollama serve

<span class="c"># Call it via OpenAI-compatible endpoint</span>
curl http://localhost:11434/v1/chat/completions <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"model": "mistral:7b-instruct-q4_K_M", "messages": [{"role": "user", "content": "Classify this document..."}]}'</span>
</code></pre></div></div>

<p>The OpenAI-compatible endpoint means that any application using the standard OpenAI SDK can switch to local inference by changing the <code class="language-plaintext highlighter-rouge">base_url</code> parameter — no business logic changes required. This is the migration path that makes the break-even calculation actionable rather than theoretical.</p>

<h2 id="hybrid-routing-the-strategic-playbook">Hybrid Routing: The Strategic Playbook</h2>

<p>The economically optimal architecture for most enterprises is not a binary choice between API and local inference — it is <strong>intelligent routing</strong> based on task complexity and volume:</p>

<ol>
  <li><strong>High-volume, bounded-complexity tasks</strong> (classification, extraction, structured generation): Local SLM on owned or leased hardware</li>
  <li><strong>Medium-volume, moderate-complexity tasks</strong> (summarisation, code review, QA): API with Haiku or Flash tier models, with aggressive prompt caching</li>
  <li><strong>Low-volume, high-complexity tasks</strong> (strategic analysis, complex reasoning, novel problem-solving): API with Sonnet or Pro tier models, no caching overhead justified</li>
</ol>

<p>This tiered architecture combines the cost efficiency of local inference for the high-volume base with the capability ceiling of frontier models for the tasks that genuinely require them. In practice, for most enterprise workloads, 60–70% of API spend concentrates in the high-volume bounded-complexity tier — exactly the tier where local SLMs are most competitive.</p>

<p><strong>Data sovereignty</strong> is an additional non-financial justification for local deployment that applies in regulated industries. Prompts, context, and outputs that never leave your infrastructure cannot appear in a provider’s training data, cannot be subject to a provider’s data retention policy change, and cannot be disclosed in a provider security incident. For healthcare, legal, and financial services workloads, this consideration may justify local inference even when the pure cost calculus is marginal.</p>

<hr />

<p><strong>Next in the series:</strong> <a href="/2026-ai-roadmap/">From Tokenmaxxing to Economic Governance: The 2026 AI Roadmap for CTOs Who Want a 2027 Budget</a> — synthesising the full series into a governance framework, a prioritised action list, and the consulting CTA for teams that want the guardrails Uber missed.</p>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Economics" /><category term="token economy" /><category term="cost management" /><category term="SLM" /><category term="quantization" /><category term="local inference" /><category term="Phi-3" /><category term="Mistral" /><category term="Llama" /><category term="ROI" /><category term="on-premises AI" /><summary type="html"><![CDATA[Every post in this series has examined a different dimension of API cost: token pricing, context scaling, agentic multiplication, per-minute billing, cache optimisation, and infrastructure constraints. The implicit assumption throughout has been that API inference is the only option — that your choice is which provider, which model, and which optimisation technique to apply within the API billing model. In 2026, that assumption requires re-examination. Small Language Models (SLMs) combined with 4-bit quantisation have reached a capability and cost profile that makes local inference economically rational for a well-defined and growing class of enterprise workloads. Understanding when to bypass the API entirely is now a first-order strategic decision, not a research-stage exploration.]]></summary></entry><entry><title type="html">Power as the New Token: Gartner’s $1.37 Trillion Infrastructure Bet and the Physics of AI at Scale</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/infrastructure-bottleneck/" rel="alternate" type="text/html" title="Power as the New Token: Gartner’s $1.37 Trillion Infrastructure Bet and the Physics of AI at Scale" /><published>2026-05-21T07:00:00+00:00</published><updated>2026-05-21T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/infrastructure-bottleneck</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/infrastructure-bottleneck/"><![CDATA[<p>Every discussion of AI cost in 2026 eventually arrives at the same upstream constraint: electricity. The token prices on every API pricing page, the per-minute rates, the per-seat subscriptions — they are all downstream of a physical fact that no software optimisation can dissolve. <strong>Training and running large language models requires power at a scale that is straining the capacity of data centres, national grids, and the global supply chains for the hardware that converts electricity into inference.</strong> Gartner’s forecast of $1.37 trillion in AI infrastructure spending by 2026 is not a number about software or services — it is primarily a number about construction, cooling, and electrical generation. Understanding this layer is essential for any CTO who wants to reason accurately about the medium-term trajectory of AI costs.</p>

<!--more-->

<h2 id="the-gartner-numbers-in-context">The Gartner Numbers in Context</h2>

<p>Gartner’s AI spending forecast projects global AI-related spend reaching $2.52 trillion by 2026, with approximately <strong>54% — roughly $1.37 trillion</strong> — allocated to infrastructure: data centre construction, GPU and TPU hardware procurement, networking, and the power and cooling systems required to operate them.</p>

<p>This is an extraordinary figure. For comparison, global cloud infrastructure spending (excluding AI-specific builds) was approximately $270 billion in 2023. The AI infrastructure build-out represents a <strong>5x acceleration</strong> over the baseline cloud expansion rate, compressed into a 24-month window.</p>

<p>The composition of that spend:</p>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Estimated Share</th>
      <th>2026 Spend Estimate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPU / accelerator hardware</td>
      <td>38%</td>
      <td>$521B</td>
    </tr>
    <tr>
      <td>Data centre construction</td>
      <td>22%</td>
      <td>$301B</td>
    </tr>
    <tr>
      <td>Power &amp; cooling systems</td>
      <td>18%</td>
      <td>$247B</td>
    </tr>
    <tr>
      <td>Networking &amp; interconnect</td>
      <td>12%</td>
      <td>$164B</td>
    </tr>
    <tr>
      <td>Storage infrastructure</td>
      <td>10%</td>
      <td>$137B</td>
    </tr>
  </tbody>
</table>

<p>Hardware dominates, but the power and cooling line — $247 billion — is the constraint that does not scale with capital. You can order more H100s. You cannot order more megawatts from a grid that is already at capacity.</p>

<h2 id="the-physics-of-ai-power-demand">The Physics of AI Power Demand</h2>

<p>A single NVIDIA H100 GPU, the dominant training accelerator in 2025–2026, has a thermal design power (TDP) of <strong>700 watts</strong>. A standard AI training cluster uses 8 H100s per server node. A modest training run for a frontier model might use 1,024 such nodes:</p>

\[P_{cluster} = 1{,}024 \text{ nodes} \times 8 \text{ GPUs/node} \times 700\text{W} = 5.73\text{ MW}\]

<p>That is 5.73 megawatts for the GPUs alone — before accounting for the additional 30–40% overhead of CPU, networking, storage, and crucially, <strong>cooling</strong>. A data centre’s Power Usage Effectiveness (PUE) ratio — the ratio of total facility power to IT equipment power — typically ranges from 1.2 to 1.5 for modern hyperscale facilities. A PUE of 1.35 means the full facility draw for that cluster is:</p>

\[P_{facility} = 5.73\text{ MW} \times 1.35 = 7.74\text{ MW}\]

<p>A frontier model training run lasting 90 days at this cluster size consumes:</p>

\[E_{training} = 7.74\text{ MW} \times 90 \times 24\text{ h} = 16{,}718\text{ MWh}\]

<p>That is 16.7 gigawatt-hours for a single training run — equivalent to the annual electricity consumption of approximately 1,500 average US households.</p>

<p>Inference at scale is a separate problem. Unlike training, which occurs once per model version, <strong>inference runs continuously</strong>, serving every user query, every API call, every agentic loop in production. The world’s frontier AI providers are now operating thousands of nodes in continuous inference mode, 24 hours a day, 365 days a year. The inference power demand is arguably more consequential than training, because it does not stop.</p>

<h2 id="the-grid-constraint-is-real-and-binding">The Grid Constraint Is Real and Binding</h2>

<p>The data centre industry’s power demand has grown to the point where grid capacity is the primary constraint on hyperscaler expansion in several major markets. Virginia’s Northern Virginia data centre corridor — historically the world’s largest concentration of data centre capacity — is facing utility interconnection queues measured in <strong>years</strong>, not months. Power delivery for new facilities signed in 2024 is not expected until 2027 or later in several jurisdictions.</p>

<p>Microsoft, Google, Amazon, and Meta have collectively committed to construction projects representing over 40 gigawatts of new data centre capacity globally through 2030. The announced capacity significantly exceeds current grid availability in the target markets, driving investment into:</p>

<ul>
  <li><strong>On-site natural gas generation</strong> — peaking plants co-located with data centres to bypass grid interconnection delays</li>
  <li><strong>Nuclear power agreements</strong> — Microsoft’s much-publicised agreement to restart a Three Mile Island reactor unit specifically for AI data centre supply</li>
  <li><strong>Long-term renewable PPAs</strong> — power purchase agreements for solar and wind that lock in capacity years in advance</li>
  <li><strong>Advanced cooling technologies</strong> — liquid cooling, immersion cooling, and direct-to-chip solutions that reduce cooling overhead and improve PUE, extending the effective capacity of existing power connections</li>
</ul>

<p>Each of these strategies has a cost premium relative to standard grid electricity. That premium is embedded in the price you pay per token.</p>

<h2 id="what-this-means-for-api-pricing-trajectory">What This Means for API Pricing Trajectory</h2>

<p>The relationship between infrastructure investment and API pricing is not immediate — providers absorb infrastructure costs over multi-year depreciation schedules and use scale economics to suppress per-unit costs. But the directional pressure is clear.</p>

<p>As grid constraints extend build timelines, as hardware procurement costs remain elevated due to accelerator supply chain concentration (TSMC’s advanced node capacity is the ultimate upstream constraint), and as cooling overhead increases with thermal density, the <strong>floor price for inference</strong> — the minimum cost at which a provider can run a frontier model without operating at a loss — rises.</p>

<p>The providers with the most efficient infrastructure (lowest PUE, best hardware utilisation, most aggressive custom silicon investment) will hold the pricing advantage. This is why Google’s custom TPU v5 fleet and its investment in 24/7 carbon-free energy procurement are not just sustainability initiatives — they are <strong>cost moats</strong>.</p>

<table>
  <thead>
    <tr>
      <th>Provider</th>
      <th>Primary Accelerator</th>
      <th>Custom Silicon</th>
      <th>Noted Power Strategy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Google</td>
      <td>TPU v5 / v6</td>
      <td>Yes (in-house)</td>
      <td>24/7 CFE matching, on-site generation</td>
    </tr>
    <tr>
      <td>Microsoft / OpenAI</td>
      <td>NVIDIA H100/H200 + Maia</td>
      <td>Partial (Maia)</td>
      <td>Nuclear PPA (TMI restart)</td>
    </tr>
    <tr>
      <td>Amazon / AWS</td>
      <td>Trainium 2 / Inferentia</td>
      <td>Yes (in-house)</td>
      <td>Renewable PPAs, on-site generation</td>
    </tr>
    <tr>
      <td>Anthropic</td>
      <td>NVIDIA H100/H200</td>
      <td>No</td>
      <td>AWS-hosted, inherits AWS power strategy</td>
    </tr>
    <tr>
      <td>Meta</td>
      <td>NVIDIA H100 + MTIA</td>
      <td>Partial (MTIA)</td>
      <td>On-site solar, long-term wind PPAs</td>
    </tr>
  </tbody>
</table>

<h2 id="the-strategic-implication-internalise-the-infrastructure-risk">The Strategic Implication: Internalise the Infrastructure Risk</h2>

<p>For engineering leaders and CTOs, the infrastructure bottleneck has two strategic implications that extend beyond watching API prices.</p>

<p><strong>First, provider concentration risk is amplified.</strong> If your entire AI stack runs through a single API provider, you are exposed not just to that provider’s pricing decisions but to their infrastructure constraints. A provider that cannot expand data centre capacity in your required region may respond with higher prices, lower rate limits, or degraded performance during peak demand. Diversification across providers is not just a pricing hedge — it is an infrastructure risk hedge.</p>

<p><strong>Second, the economics of on-premises inference improve relative to API.</strong> When API prices embed a premium for constrained infrastructure, the total cost of ownership for dedicated inference hardware — leased or owned — becomes more competitive. A team running 10 million daily inferences on a dedicated A100 cluster in a co-location facility with reliable power access may, under certain workload profiles, achieve lower per-inference costs than the equivalent API spend. Post 7 in this series examines this calculation in detail through the lens of Small Language Models and quantised inference.</p>

<p>The organisations that model infrastructure risk as a first-class budget variable — not just today’s API price, but the trajectory of that price given the constraints upstream — will be better positioned to make rational make-vs-buy decisions for AI infrastructure over the next 24 months.</p>

<hr />

<p><strong>Next in the series:</strong> <a href="/sovereign-ai-slms/">The Local Inference ROI: 4-Bit Quantization, SLMs, and the Case for Bypassing the API</a> — the real numbers behind running Phi-3, Mistral 7B, and Llama-3 on your own hardware, and when the economics of local inference decisively outperform the API.</p>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Economics" /><category term="token economy" /><category term="cost management" /><category term="AI infrastructure" /><category term="data centres" /><category term="power grid" /><category term="Gartner" /><category term="ROI" /><category term="GPU" /><category term="CapEx" /><summary type="html"><![CDATA[Every discussion of AI cost in 2026 eventually arrives at the same upstream constraint: electricity. The token prices on every API pricing page, the per-minute rates, the per-seat subscriptions — they are all downstream of a physical fact that no software optimisation can dissolve. Training and running large language models requires power at a scale that is straining the capacity of data centres, national grids, and the global supply chains for the hardware that converts electricity into inference. Gartner’s forecast of $1.37 trillion in AI infrastructure spending by 2026 is not a number about software or services — it is primarily a number about construction, cooling, and electrical generation. Understanding this layer is essential for any CTO who wants to reason accurately about the medium-term trajectory of AI costs.]]></summary></entry><entry><title type="html">KV Cache Optimization: Why Server-Side Prompt Caching Is the New S3 of AI Infrastructure</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/prompt-caching-kv/" rel="alternate" type="text/html" title="KV Cache Optimization: Why Server-Side Prompt Caching Is the New S3 of AI Infrastructure" /><published>2026-05-18T07:00:00+00:00</published><updated>2026-05-18T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/prompt-caching-kv</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/prompt-caching-kv/"><![CDATA[<p>In the early years of cloud computing, the insight that transformed infrastructure economics was simple: storing data once and serving it many times from a distributed object store was orders of magnitude cheaper than recomputing or re-fetching it on every request. S3 became the canonical implementation of that insight. In 2026, the equivalent insight in AI inference is <strong>server-side KV cache management</strong> — and the organisations that have operationalised it are reporting 60–90% reductions in input token costs for workloads with stable, repeating context. This is not a niche optimisation. For any production AI system with a consistent system prompt, a shared knowledge base, or a high-volume API, prompt caching is the highest-ROI infrastructure investment available in the current AI cost landscape.</p>

<!--more-->

<h2 id="what-the-kv-cache-actually-is">What the KV Cache Actually Is</h2>

<p>When a transformer processes an input sequence, it computes key-value (KV) pairs for each token in the attention layers. These KV pairs are the intermediate representation that enables the model to contextualise each token against all preceding tokens. Computing them is the expensive part of processing input — it is the work that makes long-context models computationally intensive and expensive to run.</p>

<p><strong>Server-side prompt caching</strong> stores these pre-computed KV pairs on the inference provider’s servers, keyed to a specific prefix of your input prompt. When a subsequent request begins with that same prefix, the provider skips the KV computation for the cached portion and begins processing only from the cache boundary forward.</p>

<p>The cost structure shifts accordingly:</p>

\[C_{cached} = T_{cached} \cdot P_{cache\_read} + T_{uncached} \cdot P_{in} + T_{out} \cdot P_{out}\]

<p>Where \(P_{cache\_read}\) is the discounted rate for reading from cache, and \(T_{cached}\) is the portion of input tokens that hit the cache. For Anthropic’s implementation:</p>

<table>
  <thead>
    <tr>
      <th>Token Type</th>
      <th>Price (per 1M tokens)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Standard input</td>
      <td>$3.00</td>
    </tr>
    <tr>
      <td>Cache write (first call)</td>
      <td>$3.75</td>
    </tr>
    <tr>
      <td>Cache read (subsequent calls)</td>
      <td>$0.30</td>
    </tr>
    <tr>
      <td>Output</td>
      <td>$15.00</td>
    </tr>
  </tbody>
</table>

<p>The cache read price is <strong>90% cheaper</strong> than the standard input price. On the first call, you pay a 25% premium to write to cache. On every subsequent call that hits the cache, you pay 10 cents on the dollar. The break-even point is two calls: if a cached prefix is read at least twice, you recover the write premium and begin generating savings.</p>

<h2 id="the-new-s3-analogy-and-why-it-holds">The “New S3” Analogy and Why It Holds</h2>

<p>S3 changed infrastructure economics by separating the cost of storing data from the cost of generating it. You generate data once (expensively), store it cheaply, and serve it repeatedly at marginal cost. The pattern is so fundamental that it now underlies every CDN, every database read replica, and every API response cache in modern infrastructure.</p>

<p>KV cache in AI inference follows the identical economic logic. You compute the KV representation of your prompt once (at the write price), store it in the provider’s inference cluster, and retrieve it repeatedly at marginal cost (the read price). The “content” being cached is not bytes in a file — it is the model’s internal representation of your text. But the economic structure is identical.</p>

<p>The analogy extends to management principles:</p>

<table>
  <thead>
    <tr>
      <th>S3 Concept</th>
      <th>KV Cache Equivalent</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Object key</td>
      <td>Cache prefix hash</td>
    </tr>
    <tr>
      <td>TTL / expiration policy</td>
      <td>Cache invalidation on prompt change</td>
    </tr>
    <tr>
      <td>Cache-Control headers</td>
      <td>Prompt structure discipline</td>
    </tr>
    <tr>
      <td>CDN edge caching</td>
      <td>Provider-side KV storage</td>
    </tr>
    <tr>
      <td>Cache hit rate</td>
      <td>Cost reduction multiplier</td>
    </tr>
    <tr>
      <td>Cache miss cost</td>
      <td>Full input token cost</td>
    </tr>
    <tr>
      <td>Write-through vs write-back</td>
      <td>Single vs batched cache priming</td>
    </tr>
  </tbody>
</table>

<p>Just as S3 economics improve with higher cache hit rates, KV cache economics improve with higher prefix stability. The engineering challenge in both cases is the same: design your data structures (prompts or files) to maximise the portion that is stable and reusable.</p>

<h2 id="engineering-for-cache-hit-rate">Engineering for Cache Hit Rate</h2>

<p>The critical insight for maximising KV cache ROI is that <strong>cache hits are a function of prompt architecture</strong>, not just prompt length. A provider’s caching implementation stores and retrieves KV pairs based on a prefix match — the cached prefix must be bit-for-bit identical to the beginning of the new request. Any modification to the prefix, however small, invalidates the cache entry.</p>

<p>This means prompt engineering and cost engineering are the same activity. Practices that maximise cache hit rates:</p>

<p><strong>Place stable content at the beginning.</strong> System prompts, role definitions, static knowledge bases, and fixed instructions should come first in your prompt structure. Dynamic content — user inputs, session-specific context, real-time data — should come last. The cache boundary falls immediately before the first dynamic element.</p>

<p><strong>Separate system prompts from user prompts architecturally.</strong> Many frameworks co-mingle system instructions with per-request context. Restructuring to maintain a clean boundary between the static system layer and the dynamic user layer is the single most impactful structural change for cache performance.</p>

<p><strong>Avoid timestamp and UUID injection in system prompts.</strong> A common anti-pattern is including a current timestamp or session ID in the system prompt for logging purposes. This guarantees a cache miss on every request because the prefix changes every second. Move dynamic identifiers to user-turn messages or metadata fields instead.</p>

<p><strong>Version and manage your system prompts explicitly.</strong> Treat system prompts as code. Store them in version control. Use feature flags to roll them out. Avoid ad-hoc modifications that create new cache entries and orphan old ones, consuming cache storage without delivering hits.</p>

<h2 id="quantifying-the-roi-at-scale">Quantifying the ROI at Scale</h2>

<p>Consider an enterprise deployment of a customer service AI with the following characteristics:</p>

<ul>
  <li>System prompt: 4,500 tokens (product knowledge, persona, escalation policy)</li>
  <li>Average user query: 150 tokens</li>
  <li>Average response: 400 tokens</li>
  <li>Daily query volume: 100,000 calls</li>
  <li>All calls share the identical system prompt (cache write on day one, reads thereafter)</li>
</ul>

<p><strong>Without caching:</strong>
\(C_{daily} = 100{,}000 \times \left( \frac{4{,}650}{1{,}000{,}000} \times 3.00 + \frac{400}{1{,}000{,}000} \times 15.00 \right)\)
\(= 100{,}000 \times (0.01395 + 0.006) = 100{,}000 \times 0.01995 = \$1{,}995/\text{day}\)</p>

<p><strong>With caching (4,500-token system prompt cached; 150-token user query uncached):</strong>
\(C_{daily} = 100{,}000 \times \left( \frac{4{,}500}{1{,}000{,}000} \times 0.30 + \frac{150}{1{,}000{,}000} \times 3.00 + \frac{400}{1{,}000{,}000} \times 15.00 \right)\)
\(= 100{,}000 \times (0.00135 + 0.00045 + 0.006) = 100{,}000 \times 0.00780 = \$780/\text{day}\)</p>

<p><strong>Daily savings: $1,215. Annual savings: approximately $443,000</strong> — for a single prompt caching configuration change, with zero change to model quality or user experience.</p>

<p>The ROI is immediate and compounding. At 100,000 calls per day, the cache write cost on day one (paying $3.75/1M instead of $3.00/1M for the 4,500 cached tokens) is fully recovered within the first 45 minutes of operation.</p>

<h2 id="multi-document-knowledge-bases-and-tiered-caching">Multi-Document Knowledge Bases and Tiered Caching</h2>

<p>The single-system-prompt case is the simplest illustration, but the pattern scales to more complex architectures. A RAG system with a tiered knowledge base can implement <strong>layered caching</strong>:</p>

<ul>
  <li><strong>Tier 1 (always cached):</strong> Core system prompt, static product catalogue, compliance policies — 10,000–50,000 tokens that appear in every call</li>
  <li><strong>Tier 2 (segment-cached):</strong> Department-specific knowledge bases, customer tier policies — cached per user segment, rotating based on session attributes</li>
  <li><strong>Tier 3 (uncached):</strong> Real-time retrieved documents, current conversation history, dynamic user data</li>
</ul>

<p>This architecture maximises cache hit rate for the most expensive stable tokens while preserving the flexibility to inject dynamic context at the tail of each prompt. The engineering investment is primarily in prompt structure discipline — the infrastructure change is configuration, not code.</p>

<h2 id="the-management-imperative">The Management Imperative</h2>

<p>Prompt caching is not an advanced optimisation for teams with sophisticated ML platforms. It is table-stakes cost management for any production AI deployment running more than a few hundred daily calls. The AI software development costs that continue to escalate in 2026 are disproportionately concentrated in organisations that have not yet adopted this practice.</p>

<p>The governance requirement is simple: audit every production prompt for its stable and dynamic components. Measure the token split. Calculate the savings from caching the stable prefix. Implement the structural change. The investment is measured in engineering hours; the return is measured in budget lines.</p>

<p>The teams that treat their KV cache hit rate as a production metric — alongside latency, error rate, and throughput — will find that AI infrastructure begins to behave like cloud infrastructure: optimisable, predictable, and defensible to finance.</p>

<hr />

<p><strong>Next in the series:</strong> <a href="/infrastructure-bottleneck/">Power as the New Token: Gartner’s $1.37 Trillion Infrastructure Bet and the Physics of AI at Scale</a> — why the constraint on AI growth in 2026 is not model capability or API pricing, but electricity and the data centre supply chain.</p>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Economics" /><category term="token economy" /><category term="cost management" /><category term="KV cache" /><category term="prompt caching" /><category term="Anthropic" /><category term="inference optimization" /><category term="ROI" /><category term="AI infrastructure" /><summary type="html"><![CDATA[In the early years of cloud computing, the insight that transformed infrastructure economics was simple: storing data once and serving it many times from a distributed object store was orders of magnitude cheaper than recomputing or re-fetching it on every request. S3 became the canonical implementation of that insight. In 2026, the equivalent insight in AI inference is server-side KV cache management — and the organisations that have operationalised it are reporting 60–90% reductions in input token costs for workloads with stable, repeating context. This is not a niche optimisation. For any production AI system with a consistent system prompt, a shared knowledge base, or a high-volume API, prompt caching is the highest-ROI infrastructure investment available in the current AI cost landscape.]]></summary></entry><entry><title type="html">Beyond the Token: Google’s Per-Minute Pricing and the Disruption of Real-Time AI Economics</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/per-minute-pricing/" rel="alternate" type="text/html" title="Beyond the Token: Google’s Per-Minute Pricing and the Disruption of Real-Time AI Economics" /><published>2026-05-14T07:00:00+00:00</published><updated>2026-05-14T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/per-minute-pricing</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/per-minute-pricing/"><![CDATA[<p>The token has been the unit of account for AI inference since the first public OpenAI APIs launched in 2020. Every pricing page, every cost model, every engineering estimate in the industry has been denominated in tokens per million. In 2026, Google disrupted that convention with the Gemini Live API, priced not at the token level but at <strong>$0.005 per minute of audio interaction</strong>. This is not a minor pricing variant — it is a structural challenge to the assumptions that underpin every real-time AI application budget. Understanding when per-minute pricing is economically superior to per-token pricing, and when it is not, is now a required competency for any engineering leader deploying AI at scale.</p>

<!--more-->

<h2 id="why-google-moved-to-per-minute-pricing">Why Google Moved to Per-Minute Pricing</h2>

<p>The Gemini Live API is designed for continuous, real-time, multimodal interactions: voice conversations, live video analysis, streaming audio transcription paired with language model response. In these workloads, the concept of a discrete “token” begins to break down as a billing unit.</p>

<p>Audio inference presents a specific challenge. A one-minute audio clip, when tokenised for an audio-capable model, produces a token count that varies with the audio’s information density — silence, background noise, speech rate, and speaker count all affect tokenisation. The variance in token counts for semantically equivalent audio makes per-token pricing difficult to plan against. A user who speaks slowly on a quiet line might generate 800 tokens per minute; a rapid speaker in a noisy environment might generate 3,000 tokens per minute for a conversation of equivalent informational value.</p>

<p>Per-minute pricing eliminates this variance. The billing unit is time, which is constant regardless of audio characteristics. From Google’s perspective, it also aligns billing with the most predictable dimension of server-side resource consumption: wall-clock inference time.</p>

<h2 id="gemini-2026-pricing-architecture">Gemini 2026 Pricing Architecture</h2>

<p>As of 2026, the Gemini family pricing for real-time and standard workloads:</p>

<table>
  <thead>
    <tr>
      <th>Model / Mode</th>
      <th>Pricing Unit</th>
      <th>Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gemini 2.0 Flash Live (audio)</td>
      <td>Per minute</td>
      <td>$0.005</td>
    </tr>
    <tr>
      <td>Gemini 2.0 Flash Live (video)</td>
      <td>Per minute</td>
      <td>$0.005</td>
    </tr>
    <tr>
      <td>Gemini 2.0 Flash (text/image input)</td>
      <td>Per 1M tokens</td>
      <td>$0.075</td>
    </tr>
    <tr>
      <td>Gemini 2.0 Flash (text output)</td>
      <td>Per 1M tokens</td>
      <td>$0.30</td>
    </tr>
    <tr>
      <td>Gemini 1.5 Pro (input)</td>
      <td>Per 1M tokens</td>
      <td>$1.25</td>
    </tr>
    <tr>
      <td>Gemini 1.5 Pro (output)</td>
      <td>Per 1M tokens</td>
      <td>$5.00</td>
    </tr>
    <tr>
      <td>Gemini 2.0 Pro (input)</td>
      <td>Per 1M tokens</td>
      <td>$1.25</td>
    </tr>
    <tr>
      <td>Gemini 2.0 Pro (output)</td>
      <td>Per 1M tokens</td>
      <td>$10.00</td>
    </tr>
  </tbody>
</table>

<p>The per-minute rate for Live API is notably aggressive. At $0.005/minute, a 10-minute voice interaction costs $0.05 — five cents. A 1,000-call daily volume of 10-minute interactions costs $50 per day, or approximately $18,000 per year.</p>

<h2 id="when-per-minute-pricing-wins-the-arbitrage-analysis">When Per-Minute Pricing Wins: The Arbitrage Analysis</h2>

<p>The economic comparison between per-minute and per-token pricing depends on the <strong>token density</strong> of your specific workload. The crossover point can be derived from the token cost formula:</p>

\[C_{token} = T_{in} \cdot P_{in} + T_{out} \cdot P_{out}\]

<p>For a per-minute model, cost is simply:</p>

\[C_{minute} = t \cdot P_{min}\]

<p>Where \(t\) is the duration in minutes and \(P_{min}\) is the per-minute rate. The per-minute model is more economical when:</p>

\[t \cdot P_{min} &lt; T_{in} \cdot P_{in} + T_{out} \cdot P_{out}\]

<p>For a Gemini Flash text comparison: at $0.075/1M input and $0.30/1M output, a one-minute interaction that generates 2,500 input tokens and 400 output tokens costs:</p>

\[C_{token} = \frac{2{,}500}{1{,}000{,}000} \times 0.075 + \frac{400}{1{,}000{,}000} \times 0.30 = 0.000188 + 0.00012 = \$0.000308\]

<p>The per-minute rate for the same one-minute interaction: \(\$0.005\). The token rate is <strong>16x cheaper</strong> for a text-only workload of this density.</p>

<p>However, the economics reverse for multimodal workloads. A one-minute audio clip at 150 words per minute, transcribed and processed with a system prompt and conversation history, can generate 20,000–40,000 input tokens depending on the audio model’s tokenisation. At Gemini Pro rates ($1.25/1M input):</p>

\[C_{token} = \frac{30{,}000}{1{,}000{,}000} \times 1.25 = \$0.0375\]

<p>The per-minute rate ($0.005) is now <strong>7.5x cheaper</strong>. This is the workload class for which per-minute pricing was designed, and it represents a genuine economic advantage for real-time voice and video applications.</p>

<h2 id="the-workload-classification-problem">The Workload Classification Problem</h2>

<p>The practical challenge for engineering leaders is that most real-world AI applications are not purely text or purely audio — they are <strong>hybrid workflows</strong> in which the optimal billing model differs by feature. A customer service platform might use:</p>

<ul>
  <li>Text-based intent classification (strongly favours per-token)</li>
  <li>Voice interaction for the main conversation (favours per-minute)</li>
  <li>Multimodal screenshot analysis for issue diagnosis (depends on image resolution and token encoding)</li>
  <li>Text output for case notes and follow-up emails (strongly favours per-token)</li>
</ul>

<p>An architecture that routes each workload component to the appropriate billing model and provider can reduce total inference costs by 40–60% compared to a uniform model selection. This is <strong>multi-provider inference routing</strong>, and it is becoming a standard capability in mature AI platform architectures.</p>

<p>The routing logic is straightforward in principle but requires empirical calibration:</p>

<ol>
  <li>Profile each workload component for average token counts per unit of time</li>
  <li>Calculate the token-rate cost and per-minute-rate cost for each</li>
  <li>Route to the cheaper option, accounting for latency and quality constraints</li>
  <li>Re-calibrate quarterly as pricing evolves</li>
</ol>

<h2 id="competitive-implications-of-googles-move">Competitive Implications of Google’s Move</h2>

<p>Google’s per-minute pricing is not just a billing convenience — it is a market positioning decision. By making Gemini Live API economical for high-volume real-time applications, Google is targeting the workload category where it has the strongest infrastructure advantage: streaming inference at scale, backed by TPU v5 hardware optimised for throughput rather than latency.</p>

<p>The competitive pressure on Anthropic and OpenAI is real. Neither currently offers a per-minute pricing tier for real-time audio. Both are priced on token-based models that make extended voice interactions disproportionately expensive relative to Google’s offering. If per-minute pricing proves to be the preferred billing model for the voice assistant category — which represents one of the highest-volume AI deployment patterns in consumer and enterprise markets — the pricing structure advantage could translate into significant market share over a 12–18 month horizon.</p>

<p>For operators, this creates a short-term opportunity to capture genuine cost savings by adopting Gemini Live for qualifying workloads, while maintaining existing Anthropic or OpenAI integrations for text-heavy tasks where the token economics remain superior.</p>

<h2 id="governance-implications-time-based-spend-monitoring">Governance Implications: Time-Based Spend Monitoring</h2>

<p>Per-minute billing requires different monitoring primitives than per-token billing. The relevant metrics shift from:</p>

<ul>
  <li>Tokens per call → Minutes per session</li>
  <li>Token budget per user → Session duration budget per user</li>
  <li>Context window utilisation → Session length distribution</li>
</ul>

<p>An organisation accustomed to token-level spend instrumentation will need to extend its observability stack to capture session duration metrics for Live API workloads. The risk profile also changes: rather than a single call that unexpectedly balloons to 200K tokens, the equivalent risk in per-minute billing is a session that runs for an unexpected duration — a user who leaves an audio session open, an agent that fails to terminate a real-time task, or a voice interface that enters a conversational loop.</p>

<p>Duration limits, idle session timeouts, and session cost caps are the per-minute equivalents of context window limits and token budgets. They require explicit implementation. The meter runs whether or not the session is producing value.</p>

<hr />

<p><strong>Next in the series:</strong> <a href="/prompt-caching-kv/">KV Cache Optimization: Why Server-Side Cache Is the New S3 of AI Infrastructure</a> — the technical mechanics of prompt caching, why it is the single highest-ROI optimisation available to most teams today, and how to architect your prompts to maximise cache hit rates.</p>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Economics" /><category term="token economy" /><category term="cost management" /><category term="Gemini" /><category term="Google" /><category term="per-minute pricing" /><category term="real-time AI" /><category term="multimodal" /><category term="ROI" /><summary type="html"><![CDATA[The token has been the unit of account for AI inference since the first public OpenAI APIs launched in 2020. Every pricing page, every cost model, every engineering estimate in the industry has been denominated in tokens per million. In 2026, Google disrupted that convention with the Gemini Live API, priced not at the token level but at $0.005 per minute of audio interaction. This is not a minor pricing variant — it is a structural challenge to the assumptions that underpin every real-time AI application budget. Understanding when per-minute pricing is economically superior to per-token pricing, and when it is not, is now a required competency for any engineering leader deploying AI at scale.]]></summary></entry><entry><title type="html">The Infinite Spend Bug: Recursive Agent Loops and the Metered Future of Agentic AI</title><link href="https://goldmanmalka.com/https://goldmanmalka.com/agentic-loops/" rel="alternate" type="text/html" title="The Infinite Spend Bug: Recursive Agent Loops and the Metered Future of Agentic AI" /><published>2026-05-11T07:00:00+00:00</published><updated>2026-05-11T07:00:00+00:00</updated><id>https://goldmanmalka.com/https://goldmanmalka.com/agentic-loops</id><content type="html" xml:base="https://goldmanmalka.com/https://goldmanmalka.com/agentic-loops/"><![CDATA[<p>A software bug that causes infinite recursion terminates with a stack overflow. A <strong>token bug</strong> — an agentic AI loop that recurses without a termination condition — terminates with a billing invoice. In 2026, as autonomous agents displaced simple chat completions as the primary AI interaction pattern, organisations discovered that the economics of agentic systems are fundamentally different from those of single-shot inference. Anthropic’s decision to move Claude agents onto metered billing across all subscription tiers was not a product update. It was a signal that the industry has reached an inflection point where agent economics require the same governance discipline as cloud infrastructure.</p>

<!--more-->

<h2 id="what-makes-an-agent-loop-different-from-a-completion">What Makes an Agent Loop Different from a Completion</h2>

<p>A standard LLM completion is a bounded transaction: a prompt goes in, a response comes out, and the API call closes. Cost is deterministic at the time of the request, given the context window size.</p>

<p>An <strong>agentic loop</strong> is structurally different. It is a control flow in which the model’s output determines the next action, which in turn generates new input, which feeds another model call. The loop terminates when the model decides it has completed its objective — or when an external circuit breaker intervenes. If neither condition is met cleanly, the loop continues.</p>

<p>The cost of a single agent task is therefore not the cost of one API call. It is the sum of all API calls across the entire task execution:</p>

\[C_{agent} = \sum_{k=1}^{K} \left( T_{in}^{(k)} \cdot P_{in} + T_{out}^{(k)} \cdot P_{out} \right)\]

<p>Where \(K\) is the number of steps the agent takes to complete (or fail to complete) the task. In a well-designed agent, \(K\) is bounded. In a poorly designed one, \(K\) is determined by the model’s own assessment of task completion — which can be confused, misdirected, or pathologically optimistic.</p>

<h2 id="the-anatomy-of-a-runaway-loop">The Anatomy of a Runaway Loop</h2>

<p>Consider an agent tasked with “audit the security posture of this repository and generate a remediation plan.” The intended execution path might be:</p>

<ol>
  <li>Read repository structure</li>
  <li>Identify configuration files</li>
  <li>Check each file against a security policy</li>
  <li>Generate findings</li>
  <li>Draft remediation plan</li>
  <li>Return result</li>
</ol>

<p>In practice, an under-specified agent with access to search, read, and web-browsing tools may:</p>

<ol>
  <li>Read repository structure (1 API call)</li>
  <li>Identify 47 configuration files (1 API call)</li>
  <li>Check file 1 — finds a reference to an external library, decides to research the library’s CVE history (3 API calls)</li>
  <li>CVE research yields references to related libraries — agent expands scope (8 more API calls)</li>
  <li>Each new library reference expands the research scope further</li>
  <li>After 200+ API calls covering three hours of wall-clock time, the agent produces a report that references 340 libraries, most of which are irrelevant</li>
</ol>

<p>This is not a theoretical failure mode. It is a documented pattern in every agentic framework that allows models to self-direct their tool use without explicit step budgets. The model is behaving rationally from its own perspective — gathering more information to produce a more comprehensive output. The operator never specified a limit, so the model imposed none.</p>

<h2 id="anthropics-response-metered-agent-billing">Anthropic’s Response: Metered Agent Billing</h2>

<p>Anthropic’s 2026 shift to <strong>metered agent billing</strong> across its subscription tiers reflects a recognition that flat-rate pricing for agentic workloads is structurally unsustainable — for both the operator running the agent and for Anthropic managing server capacity.</p>

<p>Under the new model, agent runs within Claude’s built-in tools (computer use, web search, file operations) are billed against a token meter that tracks both input and output across the entire agent session. Subscription tiers receive a monthly token allocation; overages are billed at standard API rates. The architecture creates a natural pressure toward efficient agent design: organisations that burn their allocation on runaway loops face real financial consequences, not just degraded performance.</p>

<p>The implications for teams building on the Claude API rather than the consumer product are equally significant. The metered model validates a set of engineering practices that had previously been regarded as optional:</p>

<table>
  <thead>
    <tr>
      <th>Practice</th>
      <th>Pre-Metering Status</th>
      <th>Post-Metering Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Step count limits per agent run</td>
      <td>Optional best practice</td>
      <td>Financial necessity</td>
    </tr>
    <tr>
      <td>Token budget per task type</td>
      <td>Rarely implemented</td>
      <td>Standard requirement</td>
    </tr>
    <tr>
      <td>Agent run cost attribution</td>
      <td>Difficult to instrument</td>
      <td>Critical for billing accuracy</td>
    </tr>
    <tr>
      <td>Circuit breakers on tool calls</td>
      <td>Advanced implementations only</td>
      <td>Baseline engineering requirement</td>
    </tr>
    <tr>
      <td>Task scope specification in prompts</td>
      <td>Improved output quality</td>
      <td>Also cost control</td>
    </tr>
  </tbody>
</table>

<h2 id="the-three-failure-modes">The Three Failure Modes</h2>

<p>Agentic cost overruns concentrate around three patterns:</p>

<p><strong>1. Scope creep without bounds.</strong> The agent is given access to tools that allow it to expand its own task definition. Web search is the most common vector — a task that starts as “summarise this document” becomes “research all claims in this document” becomes “verify every cited source” becomes a multi-hour research engagement. Mitigation: define tool access narrowly, and include explicit scope boundaries in system prompts (“do not follow external links; work only from the provided document”).</p>

<p><strong>2. Retry loops on tool failure.</strong> When a tool call fails — a web page returns a 404, an API is rate-limited, a file cannot be parsed — a poorly calibrated agent will retry. Without a maximum retry count, transient failures become infinite loops. Mitigation: implement explicit retry budgets (maximum 3 retries per tool call) and instruct the model to report failures rather than loop on them.</p>

<p><strong>3. Verification spirals.</strong> Some models, when instructed to be thorough, enter a pattern of self-verification — generating a result, evaluating it, finding it inadequate, regenerating, re-evaluating. Each iteration is a full generation cycle. Mitigation: separate generation from evaluation into distinct agent steps with independent token budgets, and cap the number of evaluation passes.</p>

<h2 id="designing-cost-bounded-agents">Designing Cost-Bounded Agents</h2>

<p>The engineering pattern that prevents runaway loops is the <strong>explicit budget constraint</strong>, embedded at three levels:</p>

<p><strong>Prompt-level constraints.</strong> The system prompt specifies maximum steps, maximum tool calls, and acceptable scope boundaries. Example: “Complete this task in no more than 10 steps. If you cannot complete it within 10 steps, return a partial result with an explanation of what remains.”</p>

<p><strong>Framework-level guardrails.</strong> Agent orchestration frameworks (LangGraph, AutoGen, CrewAI) support maximum iteration counts and token budgets as first-class configuration. These should always be set explicitly — the default in most frameworks is unbounded.</p>

<p><strong>Infrastructure-level circuit breakers.</strong> API gateway or proxy layers can enforce hard token limits per agent session, killing runs that exceed the budget regardless of model state. This is the safety net when prompt-level and framework-level constraints fail.</p>

<p>The cost model for a well-governed agent run is:</p>

\[C_{bounded} = \min\left( C_{agent},\ K_{max} \cdot \bar{C}_{step} \right)\]

<p>Where \(K_{max}\) is the maximum permitted step count and \(\bar{C}_{step}\) is the average cost per agent step — a number that can be calibrated empirically during development and used to set \(K_{max}\) against a per-task budget constraint.</p>

<h2 id="what-the-metered-shift-means-for-architecture">What the Metered Shift Means for Architecture</h2>

<p>Anthropic’s move to metered agent billing is not an isolated vendor decision. It reflects a broader industry direction. As agents become the primary interface through which organisations interact with AI — replacing both chat UIs and batch pipelines — the cost per unit of value delivered will increasingly be measured in <strong>agent steps</strong> rather than tokens.</p>

<p>This changes the economic calculus for every AI architecture decision. The question is no longer only “which model produces the best output?” It becomes “which model produces sufficient output at the lowest step count?” A smaller, faster model that completes a task in four steps is economically superior to a more capable model that reaches the same outcome in twelve — even if the per-token price of the smaller model is higher.</p>

<p>The organisations that tune their agents for step efficiency now will have a durable cost advantage as the agentic paradigm matures. Those that optimise only for output quality, without constraining step count, will find that the meter runs whether or not the agent produces useful work.</p>

<hr />

<p><strong>Next in the series:</strong> <a href="/per-minute-pricing/">Beyond the Token: Google’s Per-Minute Pricing and What It Means for the Economics of Real-Time AI</a> — how Gemini Live API’s $0.005/minute rate is disrupting the token-per-call pricing model, and when it makes economic sense to pay for time rather than tokens.</p>]]></content><author><name>Eran Goldman-Malka</name></author><category term="AI" /><category term="Economics" /><category term="token economy" /><category term="cost management" /><category term="agentic AI" /><category term="Claude" /><category term="Anthropic" /><category term="metered billing" /><category term="ROI" /><category term="agent loops" /><summary type="html"><![CDATA[A software bug that causes infinite recursion terminates with a stack overflow. A token bug — an agentic AI loop that recurses without a termination condition — terminates with a billing invoice. In 2026, as autonomous agents displaced simple chat completions as the primary AI interaction pattern, organisations discovered that the economics of agentic systems are fundamentally different from those of single-shot inference. Anthropic’s decision to move Claude agents onto metered billing across all subscription tiers was not a product update. It was a signal that the industry has reached an inflection point where agent economics require the same governance discipline as cloud infrastructure.]]></summary></entry></feed>