What Should AI Vendors Disclose? Provenance, Training Data, and the Transparency Deficit

Eran Goldman-Malka · June 24, 2026

Users and rightsholders cannot govern what they cannot see. When AI systems retrieve scholarly content through open-access discovery, summarize PDFs, or cache web fetches, compliance depends on disclosure: what was retrieved, from where, under which license, retained for how long, and whether it could enter training pipelines.

Transparency Means Different Things for Retrieval and Training

A common category error collapses “AI transparency” into one bucket. For scholarly compliance, at least two pipelines matter:

Inference-time retrieval — browsing, API calls to Unpaywall, PDF fetches, session caching for RAG. Questions: Does the vendor store fetched PDFs? For how long? Are subprocessors involved? Can deployers get DOI-level provenance in logs?

Training-time ingestion — use of copyrighted or OA corpora in foundation model pretraining or fine-tuning. Questions: What source classes were included? Were licenses respected? Can rightholders verify opt-outs?

Fact: Commercial API terms generally prohibit Anthropic from training on customer API inputs by default (consumer vs commercial terms). That addresses customer content sent to the API—not the pretraining corpus of the base model, about which public detail remains limited.

Deployers must not confuse “we don’t train on your API data” with “your retrieval workflow is copyright-cleared.”

What Anthropic Discloses Today

Anthropic’s Transparency Hub publishes usage policy enforcement patterns, legal process handling, and safety reporting. The legal and compliance documentation covers tier selection, BAA/ZDR for healthcare, and Usage Policy applicability.

Fact: Zero Data Retention limits storage of API inputs/outputs at rest after response, with exceptions for legal and abuse-monitoring requirements.

Fact: Usage Policy updates clarify high-risk use requirements and prohibited malicious automation.

Gap (interpretation): Public documentation does not fully specify whether built-in research or browsing features cache third-party OA PDFs server-side, what subprocessors touch fetched scholarly content, or default provenance metadata returned to deployers. Procurement teams should ask directly and contractually—not infer from marketing.

Regulatory Expectations: EU AI Act and Beyond

Jurisdiction-specific: Regulation (EU) 2024/1689 (EUR-Lex) imposes transparency obligations on providers of high-risk AI systems, including instructions for use that enable deployers to interpret outputs (Article 13 — AI Act Service Desk). The regulation also addresses general-purpose AI model transparency in later chapters—applicability depends on how your system is classified and your role (provider vs deployer).

U.S.: The Copyright Office Part 3 report encourages development of licensing markets and transparency around training data (copyright.gov/ai)—policy direction, not a vendor disclosure standard.

Deployers operating globally should map features to roles under the AI Act and parallel emerging frameworks, with counsel.

Minimum Viable Transparency Framework (Proposed)

The following is an educational framework, not a legal standard. It lists disclosures procurement and compliance teams should request from any AI vendor involved in scholarly workflows:

  1. Source classes — open web, licensed partners, user uploads, OA APIs (Unpaywall named if used)
  2. Fetch behavior — real-time only vs server-side cache; TTL; geographic storage
  3. Retention and training — by tier; ZDR availability; opt-in training on consumer plans
  4. Provenance to deployer — DOI, URL, license, timestamp in logs or API callbacks where feasible
  5. Subprocessors — who touches content; DPA terms
  6. High-risk restrictions — legal, medical, financial summarization requirements
  7. Abuse monitoring — whether violation review retains content up to two years (per ZDR docs)

Vendors cannot disclose everything—trade secrets and scale matter—but deployers cannot complete DPIAs or copyright risk assessments with silence.

What Users Should Demand in Contracts

Beyond vendor marketing pages, contracts should address:

  • Prohibition on ingesting subscription-licensed content without deployer warranting rights
  • Data processing addendum with subprocessors listed
  • Audit and logging rights for enterprise deployments
  • ZDR activation where confidential or regulated data may appear in scholarly PDFs
  • Incident notification for policy violations affecting stored content
  • Clear tier requirements — no consumer OAuth for production scholarly agents (legal/compliance authentication section)

Build-side provenance remains deployer responsibility even with vendor cooperation: Unpaywall metadata should flow into your logs regardless of model vendor.

Real-World Example

An enterprise procures Claude for regulatory affairs literature monitoring. Vendor confirms ZDR and no training on API data. Security asks whether built-in web research caches OA PDFs on Anthropic infrastructure during multi-step agent runs. Public docs do not answer conclusively. DPIA stalls. Procurement issues a supplemental questionnaire; vendor response determines whether OA cancer literature with demographic tables may be processed. Transparency is not academic—it gates deployment.

Limits of Disclosure

Trade secrets. Model architecture and some safety classifiers will remain confidential.

Scale. Pretraining corpora may include billions of tokens—itemized disclosure is impractical; class-level disclosure may be the realistic floor.

Rapid product change. Features and retention policies evolve (consumer terms rollout); contracts need update mechanisms.

Bad actors. Disclosure does not stop intentional infringement; it enables lawful actors to comply.

Practical Checklist

  • Vendor questionnaire: retention, training, subprocessors, browsing/cache behavior
  • Contractual prohibition on unlicensed subscription content ingestion
  • Require provenance fields in retrieval pipelines you build (DOI, license, URL)
  • Map roles under EU AI Act (provider vs deployer) with counsel
  • Revisit disclosures when vendor updates Usage Policy or terms
  • Align internal logs with vendor ZDR/BAA configuration

Risks and Counterarguments

“Vendor certification is enough.” Certification covers vendor-customer relationship—not publisher licenses.

“We don’t need provenance if we only use OA.” OA licenses vary; bronze, NC, and PHI in articles still require records.

“Transparency helps competitors.” Selective disclosure under NDA balances trade secrets and deployer duties.

Conclusion

Transparency is what makes a two-sided governance model workable: AI users and builders need enough detail to stay compliant; creators and publishers need enough visibility to know how their OA work is being retrieved and reused.


Vendor due diligence for AI research tools—I help teams build the question set procurement actually needs. Contact me.


Relevant Sources

  1. Anthropic Transparency Hub — Anthropic — https://www.anthropic.com/transparency/system-trust-reporting
  2. Legal and compliance — Claude Code — Anthropic — https://docs.anthropic.com/en/docs/claude-code/legal-and-compliance
  3. Zero Data Retention — Anthropic — https://docs.anthropic.com/en/docs/build-with-claude/zero-data-retention
  4. Updates to Consumer Terms — Anthropic — https://www.anthropic.com/news/updates-to-our-consumer-terms
  5. Usage Policy Update — Anthropic — https://www.anthropic.com/news/usage-policy-update
  6. Regulation (EU) 2024/1689 — EUR-Lex — https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689
  7. Article 13 — AI Act — EU AI Act Service Desk — https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-13
  8. Copyright and AI hub — U.S. Copyright Office — https://copyright.gov/ai/

Twitter, Facebook