A Two-Sided Compliance Checklist: Governing AI Access to Scholarly Content

Eran Goldman-Malka · June 26, 2026

AI-assisted access to scholarly content creates a two-sided compliance problem. On one side, systems like Claude can use Unpaywall-style open-access discovery to reach legal full-text copies. On the other, creators, publishers, and rights holders must ask whether they are paid, protected, and compliant when AI retrieves and reuses their work. This article brings those threads together in shared vocabulary, paired checklists, a maturity model, and a priority action sequence—without treating OA discovery as permission for unchecked exploitation, or treating every retrieval tool as piracy.

The Through-Line

Access ≠ reuse rights. Unpaywall is an OA discovery service, not a paywall bypass. Compliance lives in licenses, contracts, platform terms, data protection, and governance—not in the green tab alone. Discovery, access, and reuse are separate layers; tier selection and vendor disclosure are separate again from copyright permission.

Shared Vocabulary

Term Meaning
Discovery Finding a legal OA location (e.g., Unpaywall API for a DOI)
Access Fetching bytes from that location
Reuse Summarize, store, embed, train, redistribute
Compliance boundary Where automation, scale, license, or ToS intervene
Bronze OA Free to read; license metadata often absent
Provenance Record of DOI, license, URL, tier, purpose, timestamp

Use this vocabulary in RACI charts, DPIAs, and vendor questionnaires so legal, library, security, and engineering teams are not talking past each other.

Side A Checklist: AI Users, Builders, and Vendors

Discovery

  • Use Unpaywall API with valid email and within documented rate limits (REST API)
  • Contact OurResearch for commercial/high-volume data feed needs
  • Do not route discovery through infringing mirrors or stolen credentials
  • Treat closed status as no OA location—not a signal to bypass

Access

  • Fetch only from returned OA URLs or separately authorized sources
  • Respect robots.txt and publisher terms for automated access
  • Do not inject institutional proxy credentials into agent fetchers
  • Rate-limit domain fetches; avoid overnight corpus harvesting without review

Reuse

  • Capture oa_status, license, and host_type before retain/embed
  • Default-deny bronze (license: null) for automated commercial reuse
  • Enforce CC BY-NC and BY-ND rules in pipeline gates
  • Map actions: read / summarize / store / embed / train / resell
  • Never infer training rights from is_oa: true

Platform and vendor

  • Mandate Commercial/API tier for enterprise scholarly workflows
  • Activate ZDR (and BAA if PHI) where required (legal/compliance)
  • Complete vendor transparency questionnaire (retention, cache, subprocessors)
  • Contractually prohibit unlicensed subscription PDF ingestion

Data protection

  • Classify scholarly PDFs for personal data before upload
  • Complete DPIA for RAG over external literature
  • Block consumer-tier shadow AI for work research

Logging and incidents

  • Log user/service, DOI, license, URL, tier, purpose, timestamp
  • Maintain playbooks for wrongful ingestion and NC violations
  • Reconstruct sessions during investigations—not only model outputs

Side B Checklist: Creators, Publishers, and Rights Holders

License clarity

  • Attach explicit license to every OA article; minimize bronze ambiguity
  • Embed machine-readable license metadata in HTML and PDF
  • Publish AI/text-mining policy separate from OA reader access
  • Understand Unpaywall oa_status for your portfolio (definitions)

Payment and author rights

  • Know OA model (gold/green/hybrid/bronze) and who paid APC
  • Use SPARC or institutional addenda where negotiable (author rights)
  • Align license choice (BY vs BY-NC) with funder policy and AI risk tolerance
  • Do not assume APC purchased protection against license-compliant commercial RAG

Monitoring and enforcement

  • Monitor bulk access and anomalous fetch patterns
  • Participate in or offer licensing markets for AI/TDM (Copyright Office AI hub)
  • Audit attribution in major AI products where feasible
  • Coordinate metadata accuracy with discovery partners

Policy advocacy

  • Distinguish OA discovery from piracy in internal training
  • Engage funders on whether CC BY should remain mandatory for all disciplines
  • Track jurisdiction-specific TDM and AI rules (EU, UK, U.S.)

Joint Governance Model

Sustainable AI-assisted scholarship requires a cross-functional forum—not siloed tool adoption.

Participants: legal, library/scholarly communication, IT/security, research office, compliance/privacy, engineering.

RACI example for literature retrieval agents:

Activity Legal Library Security Engineering
Approve OA discovery use C R I A
License-aware pipeline design C C I R/A
Tier and ZDR configuration I I R/A C
Provenance logging I C R A
Incident response R/A C R C

Cadence: annual review of OA/AI policy; trigger review on vendor term changes (consumer terms, usage policy).

Maturity Model

Tier Label Indicators
0 Ad hoc Paste PDFs into consumer chat; no provenance
1 Aware Staff trained on OA vs paywall; no automation
2 Instrumented Provenance logs; commercial tier enforced
3 Governed License-aware pipelines; DPIA; vendor DD; library alignment
4 Optimized Continuous audit; publisher/funder policy alignment; incident metrics

Most organizations discovering AI literature tools in 2026 are Tier 0–1. Moving to Tier 3 before scaling agents is cheaper than retrofitting after a copyright or data-protection incident.

Priority Action Sequence

If you need a pragmatic order of operations:

  1. Inventory all AI scholarly workflows (approved and shadow)
  2. Enforce Commercial/API tier and ZDR where confidential or regulated data may appear
  3. Instrument Unpaywall/license metadata in every retrieval path
  4. Default-deny bronze and null-license automated reuse pending legal review
  5. Publish creator/publisher AI reuse guidance (authors, journals, library)
  6. Deploy vendor transparency questionnaire and contract updates

Real-World Example

A research university starts at Tier 0: faculty paste PDFs into consumer Claude. Over two quarters they move to Tier 3: library-mediated OA resolver documentation, API-tier Claude for approved projects, provenance fields in a pilot RAG for grant teams, and a SPARC addendum campaign for junior faculty. Discovery stays on Unpaywall and licensed databases—not infringing sites. Compliance becomes demonstrable in funder audits. Neither side of the OA/AI debate “wins”; governance connects them.

What Both Sides Can Agree On

  • OA discovery infrastructure is legitimate scholarly tooling (Unpaywall)
  • Piracy and credential abuse are out of bounds
  • Licenses matter after access
  • Transparency beats assumptions
  • Education beats conclusory legal slogans

Where they disagree—CC BY vs commercial AI, training markets, bronze policy—belongs in policy and licensing forums, not in covert retrieval workarounds.

Conclusion

Both sides can be “right” at their layer and still produce organizational failure if nobody connects discovery to payment to protection. Unpaywall answers where is a legal OA copy? It does not answer may we embed, sell, or train? Claude’s terms answer how may we use this API? They do not answer may we use this paper? Creators answer what did we license? That does not alone control deployer behavior.

Governance is the connective tissue. Use the checklists above as living documents—versioned, owned, and reviewed when terms, licenses, or features change.


Ready to move from ad hoc AI literature use to governed retrieval? Contact me for a two-sided compliance assessment—discovery architecture, provenance design, tier and DPIA review, and a prioritized roadmap for your organization.


Relevant Sources

  1. **FAQ Unpaywall** — OurResearch — https://unpaywall.org/faq
  2. **REST API Unpaywall** — OurResearch — https://unpaywall.org/products/api
  3. Legal and compliance — Claude Code — Anthropic — https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance
  4. Open Access — SPARC — https://sparcopen.org/open-access/
  5. Copyright and AI hub — U.S. Copyright Office — https://copyright.gov/ai/
  6. Regulation (EU) 2024/1689 — EUR-Lex — https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689
  7. Anthropic Transparency Hub — Anthropic — https://www.anthropic.com/transparency/system-trust-reporting
  8. CC BY 4.0 — Creative Commons — https://creativecommons.org/licenses/by/4.0/

Twitter, Facebook