Judy Hanwen Shen, Ken Liu, Angelina Wang, Sarah H. Cen, Andy K. Zhang, Caroline Meinhardt, Daniel Zhang, Kevin Klyman, Rishi Bommasani, Daniel E. Ho (Stanford University)
January 26, 2026
arXiv | PDF
The paper from Stanford argues that current transparency policies are largely symbolic: they suffer from three fundamental gaps that prevent them from actually achieving their stated goals of protecting privacy, copyright, and data quality. The three fallacies are: (1) a specification gap; (2) an enforcement gap; and (3) an impact gap. The paper offers a taxonomy of disclosure levels, maps each transparency objective to what’s actually needed, and proposes technical research directions and policy fixes.
California AB 2013 is a California state law (effective 2026) requiring developers of generative AI systems to publicly post “high-level summaries” of training datasets on their websites. Covers data sources, synthetic data usage, presence of personal information, copyrighted content, and dataset statistics. It was the first U.S. law specifically mandating AI training data transparency, but the paper argues it was weakened through the legislative process from detailed requirements to vague summaries.
EU AI Act (Regulation 2024/1689) is the European Union’s comprehensive AI regulation, which classifies AI systems by risk tier and imposes different transparency requirements for each. General-purpose AI model providers must disclose a data summary including data types and copyright status. High-risk systems (healthcare, criminal justice, employment) face stricter requirements under Article 10 for data governance practices. Unlike AB 2013, the EU AI Act assigns enforcement to the EU AI Office and imposes significant fines.
GDPR (General Data Protection Regulation) is the EU’s data privacy law (2018) that requires organizations to inform individuals about data collection purposes (Article 13) and gives individuals rights over their personal data. Relevant here because GDPR’s data processing requirements apply to AI training data that contains personal information, but the paper notes that GDPR’s individual-level protections don’t map cleanly to the scale of LLM training.
Membership Inference is a technical method for determining whether a specific data point was used in a model’s training set by analyzing the model’s behavior (e.g., confidence scores, loss values) on that data point. The paper identifies this as critical for copyright and privacy verification but notes it remains unreliable at scale. A model can memorize content without being able to reproduce it verbatim, and content overlap between sources makes attribution difficult.
Data Provenance is the documented history of a piece of data, where it came from, how it was collected, what licenses apply, and how it was transformed. The paper argues that tracking provenance through the AI data supply chain (from original creators through data vendors to model developers) is essential but rarely required or practiced.
Foundation Model Transparency Index (FMTI) is a Stanford HAI project that scores major AI model developers on 100+ transparency indicators, including 10 data-related ones. Useful for comparing company practices but doesn’t specify the intended impact of each disclosure.
N-gram Overlap is a method for detecting text similarity by comparing sequences of N consecutive words between two texts. The paper highlights a critical limitation: courts have granted data owners permission to “inspect” training data using methods like substring search, but research shows LLMs can synthesize and reproduce content without any original n-grams, meaning n-gram-based membership tests can be trivially evaded.