Where Is the Compliance Boundary? Open Access, Circumvention, Scraping, and Terms of Service

Eran Goldman-Malka · June 5, 2026

The compliance line in AI-assisted scholarly retrieval is not drawn at “open versus closed.” It runs between locating publisher-authorized open-access copies and any practice that bypasses technical or contractual access controls, exceeds license scope, or violates site or API terms—even when an AI system could technically retrieve the bytes. Here we map where lawful OA ends and circumvention, scraping, and policy violations begin.

A Four-Ring Boundary Model

Compliance teams benefit from a layered model rather than a binary “legal/illegal” flag. Think of four concentric rings:

Ring 1 — Discovery services. Unpaywall API, library link resolvers, discovery layers (EBSCO, Primo, WorldCat). Lowest friction when used as documented. Contractual obligations here are primarily API terms and rate limits (REST API).

Ring 2 — Lawful access endpoints. Publisher OA pages, institutional repositories, preprint servers, subject archives. The bytes live here. Access may be free to humans; automated access may still be constrained.

Ring 3 — Permitted use. License terms (CC BY, CC BY-NC, publisher custom), embargo rules on green OA, citation requirements. This ring governs what you may do after lawful access.

Ring 4 — Prohibited or high-risk conduct. Toll-access content without authorization, credential sharing, DRM or technical protection measure (TPM) circumvention, ignoring rightholder opt-outs, training commercial models on NC-licensed corpora without permission.

An AI pipeline can pass Ring 1 and Ring 2 while failing Ring 3 or Ring 4. That is the compliance boundary this post addresses.

What Is Not Circumvention (When Done Correctly)

Fact: The Unpaywall browser extension invites users to “skip the paywall on millions of peer-reviewed journal articles” by clicking when a legal free copy exists (extension). The marketing language reflects user experience on toll-access landing pages—not unauthorized access to content publishers have withheld.

Fact: Library link-resolver integrations use Unpaywall data to offer OA copies when the institution lacks a subscription (link resolver integrations). When no OA location exists, redirecting to the toll-access publisher page is described as correct behavior.

Interpretation: These patterns are structurally different from services that obtain content through stolen credentials, exploit kits, or infringing mirrors. Unpaywall is not comparable to those tools—and compliance discussions should not collapse them into one category.

What Is Circumvention or High-Risk Analog

The following patterns sit outside lawful OA discovery even when Unpaywall or similar tools appear somewhere in the toolchain:

  • Credential laundering: Using Unpaywall to identify an article, then fetching the toll-access version via shared institutional proxy cookies or pirated credentials.
  • Status ignoring: Disregarding closed or missing OA signals and brute-forcing alternate URLs, leaked copies, or third-party infringing hosts.
  • TPM breaking: Circumventing digital rights management or access controls on publisher platforms (jurisdiction-specific: U.S. DMCA §1201 and EU copyright rules on technological protection measures may apply; this is not legal advice).

Sci-Hub and similar infringing services are not OA discovery—they are unauthorized distribution. Any AI pipeline that routes through them fails compliance regardless of how the DOI was resolved.

Scraping Versus Reading

Even when Ring 2 access is free for human readers, Ring 3 and site policies may restrict automation.

Risk analysis: Headless browsers fetching thousands of OA PDFs overnight can trigger bot management, breach-of-contract claims under terms of service, and—in the U.S.—debates about computer fraud and abuse statutes that courts have applied inconsistently (jurisdiction-specific; not legal advice). The Congressional Research Service provides a neutral overview of generative AI and copyright intersections (CRS LSB10922), but site-access questions often turn on contract and computer-access law rather than copyright alone.

The operative distinction for governance:

Pattern Typical risk profile
User-initiated single fetch for session summarization Lower
Unattended corpus harvest into persistent storage Higher
Commercial RAG index built from publisher domains at scale Highest without explicit license

Fact: Unpaywall’s API terms limit daily call volume (REST API). That is a contractual boundary distinct from—but related to—publisher scraping norms.

Browser Extension Versus AI Agent

The Unpaywall extension operates in a human context: one page, one article, one click. An AI agent operates in a machine context: loops, retries, parallel fetches, unpredictable volume, and downstream storage.

Compliance controls for agents should include:

  • Source allowlists (repositories, known OA hosts—not open web wildcards)
  • Rate limits per domain and per workflow
  • Robots.txt and ToS review before automated fetch is enabled
  • No credential injection into headless sessions
  • Separation of session read vs persistent retain (display for answer synthesis vs embed in vector DB)

Anthropic’s Usage Policy updates clarify platform-side prohibitions on malicious computer and network compromise. Your retrieval agent’s behavior must comply with both vendor policy and publisher policy—the stricter effective control wins.

Edge Cases That Break Naive Automation

Bronze OA. Free on the publisher site but license null in Unpaywall metadata (oa_status definitions). Access may be lawful; reuse for AI products is ambiguous. Default-deny for automated commercial reuse unless counsel approves.

Version mismatch. Green repository copies may be preprints or author-accepted manuscripts, not the version of record. Citation norms, commercial use, and publisher policies may differ (openaccess.nl copyright guide).

Paratext and non-article DOIs. Unpaywall indexes Crossref DOIs broadly; not every DOI is a journal article (support FAQ). Garbage DOIs in, garbage compliance out.

Extension-only bronze detection. The extension may surface bronze when it finds a PDF on-page without going through the API—behavior documented in Unpaywall support materials. Agents relying only on API metadata may classify the same article differently.

Real-World Example

A compliance team audits a “research agent” that queries Unpaywall, then fetches ten thousand PDFs overnight from publisher domains into cloud storage for RAG indexing. Ring 1 (API use) may be lawful if rate limits were respected. Ring 2 (many copies are OA) may be lawful for access. Ring 3 (licenses vary; bronze and NC articles mixed in) is unreviewed. Ring 4 (site ToS, scraping norms) is likely violated regardless of OA status. Discovery was not the failure point—automation scale and reuse without license triage was.

Technical Compliance: Policy Engine Design

Build—or buy—a policy engine that evaluates each candidate document before fetch and again before retain:

Inputs:  doi, oa_status, license, host_type, robots_txt, intended_action
Actions: display | summarize_ephemeral | store | embed | train
Decision: ALLOW | REVIEW | DENY

Recommended defaults:

  • DENY automated retain/embed when license is null
  • REVIEW all green repository copies for version and embargo
  • DENY fetch from toll-access URLs unless separate authorization exists
  • ALLOW ephemeral summarize of CC BY with attribution logging

Risks and Counterarguments

“OA means we can scrape freely.” Bronze and green complicate this. Free to read ≠ free to bulk-ingest for commercial AI.

“We’re just like the browser extension.” Scale, automation, storage, and product embedding change the risk profile. Equivalence is an engineering claim, not a legal defense.

“The API gave us a URL, so we’re covered.” The API gives a location, not a use license for your product.

Practical Checklist

  • Document prohibited sources (toll-access, credential-gated, known infringing mirrors)
  • Implement robots.txt and ToS review for automated fetch
  • Separate “read for user session” from “retain for RAG/training”
  • Default-deny bronze and unlicensed copies for automated reuse
  • Legal review for cross-border teams (EU CDSM, UK non-commercial TDM only)
  • Audit agent volume and domain spread quarterly

Conclusion

The compliance boundary is behavioral and contextual, not merely technical. Passing Unpaywall’s API is necessary but never sufficient. On the other side of the access question, creators and publishers must also ask whether open access and AI economics actually pay those who produce the work—a tension this boundary analysis does not resolve on its own.


Need a boundary assessment for your AI retrieval stack? Let’s map discovery, access, and reuse rings to your controls.


Relevant Sources

  1. **Browser Extension Unpaywall** — OurResearch — https://unpaywall.org/products/extension
  2. Link Resolver Integrations — Unpaywall Support — https://support.unpaywall.org/support/solutions/articles/44001874811-link-resolver-integrations
  3. **Integrations Unpaywall** — OurResearch — https://unpaywall.org/integrations
  4. Copyright and Open Licenses — openaccess.nl — https://www.openaccess.nl/en/publishing/copyright-and-open-licenses
  5. Usage Policy Update — Anthropic — https://www.anthropic.com/news/usage-policy-update
  6. Generative Artificial Intelligence and Copyright Law — U.S. Congress CRS — https://www.congress.gov/crs-product/LSB10922 (jurisdiction-specific)
  7. **REST API Unpaywall** — OurResearch — https://unpaywall.org/products/api
  8. Unpaywall Support FAQ — Unpaywall — https://support.unpaywall.org/support/solutions/folders/44000384007

Twitter, Facebook