Can Claude Use Unpaywall Legally? Access, Discovery, and the First Compliance Question

Eran Goldman-Malka · June 2, 2026

When teams wire Claude or another AI assistant into scholarly research workflows, the first technical question is usually practical: can we use Unpaywall to find full-text papers? The first compliance question is harder: does discovery through an open-access index grant permission to copy, store, summarize, or commercialize what we retrieve? Unpaywall can lawfully help locate publisher-authorized or repository-hosted open-access copies of scholarly articles, but using that discovery in an AI retrieval workflow still leaves separate, and often stricter, questions about copying, terms of service, licensing, and downstream reuse.

Two Sides of the Same Pipeline

Every scholarly retrieval workflow has two stakeholders whose incentives do not automatically align.

Side A — the AI user or builder: Can a system like Claude lawfully use Unpaywall-style open-access discovery to reach content?

Side B — the creator, publisher, or rights holder: Even when access is legal, are you sure you get paid, stay compliant, and remain protected when AI systems ingest or repackage your work?

Both sides deserve equal seriousness. This article does not frame the topic as paywall bypassing. Unpaywall is an open-access discovery service, not a piracy tool. Throughout, we distinguish facts (what a system does), interpretation (how law or policy may apply), and risk analysis (what can go wrong operationally).

The focus here is mechanics and first-order questions around discovery and initial retrieval—not model training, bulk redistribution, or commercial exploitation.

What Unpaywall Actually Is (and Is Not)

Fact: Unpaywall is an open database operated by the nonprofit OurResearch. It indexes open-access locations for scholarly articles identified by Crossref Digital Object Identifiers (DOI), drawing on data from tens of thousands of publishers and repositories.

Fact: When the browser extension detects a DOI on a page, it queries the Unpaywall API to retrieve the best known open-access location for that article (FAQ). The API returns structured JSON including is_oa, oa_status, and a best_oa_location object with fields such as license, host_type, and URLs (data format).

Fact: Unpaywall does not crack paywalls. It points to copies that publishers or repositories have made available. If no open-access location exists, integrations may redirect to the toll-access publisher page—behavior that link-resolver documentation describes as working correctly (link resolver integrations).

Interpretation: Describing Unpaywall as a “piracy tool” mischaracterizes its stated function. Libraries, discovery systems, and link resolvers worldwide integrate it as legitimate scholarly infrastructure (integrations). That said, misuse of URLs returned by Unpaywall—for example, unattended bulk scraping that violates publisher terms—is a separate compliance problem from discovery itself.

Open-Access Status Is Not a Single Permission

Unpaywall assigns each article an oa_status: gold, green, hybrid, bronze, or closed (oa_status definitions). These labels describe where and how openness was achieved—not a blanket grant of rights for every downstream use.

Status Plain-language meaning Compliance note
Gold Published in a fully OA journal Often carries explicit license (e.g., CC BY)
Hybrid OA article in otherwise toll-access journal License usually present on OA copy
Green Archived in repository Version and license may differ from publisher VoR
Bronze Free to read on publisher site license field may be null—reuse rights unclear
Closed No OA location indexed Discovery may still return publisher toll page

Risk: Treating is_oa: true as permission for AI ingestion, summarization-for-resale, embedding in a commercial RAG product, or model training without reading the license is one of the most common compliance failures we see in design reviews.

SPARC defines open access as free, immediate, online availability plus the rights to use articles fully in the digital environment (SPARC Open Access). That definition makes clear that OA is not merely “free to read”—but the extent of reuse rights still depends on the specific license attached to each copy.

How AI Retrieval Differs from Clicking the Green Tab

The Unpaywall browser extension is a human-in-the-loop pattern: a researcher on a publisher page clicks a tab when a legal OA copy exists (extension). An AI agent workflow is structurally different:

  1. User query or task trigger
  2. DOI resolution (from citation, Crossref, or metadata)
  3. Unpaywall API call (GET /v2/{doi}?email=...)
  4. HTTP fetch of the returned OA URL
  5. PDF/HTML parse, chunk, embed, or summarize
  6. Optional storage in vector DB, cache, or logs
  7. Output to user or downstream product

Each step can invoke different obligations: API terms, copyright reproduction, database rights (jurisdiction-specific), site terms of service, and license conditions on reuse.

Provenance requirement: At fetch time, log the DOI, oa_status, license string, host_type, URL fetched, timestamp, and document version where identifiable. Without this record, later audit—regulatory, contractual, or internal—is guesswork.

Three Compliance Layers: Discovery, Access, Reuse

Think of compliance as three stacked layers, not one switch:

Layer 1 — Discovery compliance: Are you using the Unpaywall API as intended? Requests require a valid email parameter; the documented limit is 100,000 calls per day (REST API). High-volume or commercial deployments should contact OurResearch about data feeds rather than assuming anonymous bulk access is appropriate. Agent loops that hammer the API invite HTTP 429 responses and reputational harm.

Layer 2 — Access compliance: Is the copy at the returned URL lawfully available for your fetch? For hybrid or gold OA with a clear license, access is typically uncontroversial. For bronze OA or green repository copies, access may be lawful while reuse remains ambiguous.

Layer 3 — Reuse compliance: Does your intended use—summarize, store, adapt, train, redistribute—fall within the license, publisher policy, and applicable law? Access does not imply reuse rights. A CC BY article permits broad reuse with attribution (CC BY 4.0); a bronze article with no license metadata does not give you a safe default.

Where Claude and Anthropic Fit

Anthropic’s legal and compliance documentation makes an important distinction: your relationship with the AI vendor is governed by Commercial Terms (API, Team, Enterprise) or Consumer Terms (Free, Pro, Max). These terms regulate data retention, training on your inputs, and acceptable use—they do not grant copyright permissions for third-party scholarly works.

Fact: Under Commercial Terms, Anthropic does not use customer API content for model training by default. Consumer tiers have different retention and opt-in training settings (consumer terms update).

Interpretation: Choosing the right tier is a compliance control for your organization’s data, not a substitute for publisher licenses. The Usage Policy also restricts certain high-risk uses and malicious automation; it does not replace Creative Commons or publisher terms.

Can Claude Use Unpaywall Legally? A Nuanced Answer

There is no universal yes/no answer—outcomes depend on use case, license, jurisdiction, and scale.

Likely lower-risk pattern (interpretation, not legal conclusion): A human-supervised research assistant on Commercial/API terms queries Unpaywall for a known DOI, fetches a hybrid-OA article licensed CC BY, summarizes it with proper attribution for an internal literature review, and does not retain full text beyond the session—provided publisher and repository terms of service do not prohibit the automated fetch.

Higher-risk patterns:

  • Unattended agents bulk-downloading bronze OA (license: null) for commercial RAG
  • Combining Unpaywall metadata with institutional proxy credentials to fetch toll-access versions
  • Feeding full text into consumer-tier Claude where retention and training settings may not meet enterprise confidentiality requirements
  • Inferring training rights from is_oa alone

Uncertain and jurisdiction-specific: Whether transient copies made solely for summarization qualify as fair use or fair dealing; how EU CDSM text-and-data-mining exceptions apply; the UK’s narrow TDM exception limited to non-commercial research. The U.S. Copyright Office AI initiative is actively examining training and reuse questions but does not resolve retrieval-for-summarization in every scenario.

Real-World Example

A biotech startup wires a Claude agent to Unpaywall for competitive intelligence. The agent fetches a hybrid-OA oncology paper tagged CC BY in Unpaywall’s API response. Discovery is consistent with Unpaywall’s design. Access to the publisher-hosted OA PDF is likely lawful. Reuse becomes the crux: embedding chunks in a commercial product requires CC BY attribution compliance, human oversight for medical claims, and a decision about whether persistent vector storage counts as “adaptation” under the license. None of that is settled by the green tab alone.

Risks and Counterarguments

“Unpaywall says it’s legal, so our pipeline is legal.” Unpaywall locates OA copies authorized by publishers and repositories. It does not warrant your use case, jurisdiction, volume, or downstream product.

“If it’s on the web, Claude can read it.” Visibility is not permission. Robots.txt, terms of service, and license terms may restrict automated access even when a human could read the same page for free.

Conflating extension UX with headless scraping. The extension operates at human scale with single-article intent. Agents operating at corpus scale change the risk profile even when the underlying OA copies are the same.

Practical Checklist: Side A (Builders)

  • Confirm API use within rate limits with a valid contact email
  • Capture oa_status, license, and host_type for every retrieved document
  • Map intended use (read, summarize, store, train, redistribute) against license text
  • Use Commercial/API terms for enterprise retrieval agents
  • Default-deny pipeline progression when license is null (bronze) unless legal review approves
  • Document human oversight for summaries used in regulated or high-stakes domains
  • Never infer training or commercial reuse rights from is_oa alone

Conclusion

Legal open-access discovery is a legitimate scholarly infrastructure pattern—and for many research workflows, Unpaywall is an appropriate starting point. It is the start of compliance, not the end. The boundary between lawful OA access and circumvention, scraping, and terms-of-service violations is a separate layer—governed by scale, automation, and license scope as much as by whether a copy was discoverable in the first place.


Designing agentic research workflows? If you need a compliance architecture review—discovery, provenance, data retention, and license-aware retrieval—contact me directly to discuss a focused engagement.


Relevant Sources

  1. **FAQ Unpaywall** — OurResearch — https://unpaywall.org/faq — Defines extension/API discovery model and DOI-based lookup.
  2. **REST API Unpaywall** — OurResearch — https://unpaywall.org/products/api — API requirements, email parameter, rate limits.
  3. **Data Format Unpaywall** — OurResearch — https://unpaywall.org/data-format — Schema for license, host_type, OA locations.
  4. OA status definitions — Unpaywall Support — https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean- — Permission granularity by status type.
  5. Legal and compliance (Claude Code) — Anthropic — https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance — Commercial vs consumer terms, usage policy linkage.
  6. Open Access — SPARC — https://sparcopen.org/open-access/ — Authoritative OA definition including reuse rights.
  7. Deed: CC BY 4.0 — Creative Commons — https://creativecommons.org/licenses/by/4.0/ — Standard license conditions for many OA articles.
  8. Copyright and Artificial Intelligence — U.S. Copyright Office — https://copyright.gov/ai/ — U.S. federal framing for AI/copyright intersection (jurisdiction-specific).

Twitter, Facebook