A PYMNTS Company

AI Developers Avoid Details in Initial Training Data Disclosures Under California Statute

 |  January 22, 2026

California’s Training Data Transparency Act (TDTA) has moved from theory to practice, and the first public disclosures filed by major AI developers are beginning to clarify how the statute is likely to be interpreted in the market. Early filings from OpenAI and Anthropic suggest that compliance will center on broad, generalized descriptions of training data, closely tracking the statute’s “high-level summary” language, while stopping well short of revealing dataset-specific details that companies view as competitively sensitive.

    Get the Full Story

    Complete the form to unlock this article and enjoy unlimited free access to all PYMNTS content — no additional logins required.

    yesSubscribe to our daily newsletter, PYMNTS Today.

    By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions.

    The law, enacted as AB 2013 and effective January 1, 2026, requires developers of generative AI systems made available to the public to post a public summary of the datasets used to train those systems. The statute lists 12 categories of information that must be addressed, ranging from the sources or owners of training data to whether the data includes copyrighted material, personal information, aggregate consumer information, or synthetic data. It also requires disclosures about data cleaning and processing, time periods of collection, and when datasets were first used in development.

    Notably, however, the statute does not define how much detail is required to satisfy its “high-level summary” standard, nor does it provide safe harbors or regulatory guidance distinguishing adequate disclosure from impermissible disclosure of trade secrets.

    We’d love to be your preferred source for news.

    Please add us to your preferred sources list so our news, data and interviews show up in your feed. Thanks!

    That ambiguity has long been a central concern for AI developers. According to an analysis by Goodwin Law, companies view the selection, composition, and treatment of training data as core intellectual capital. Overly granular disclosures could, in their view, enable competitors to infer training strategies or replicate model development approaches. Against that backdrop, the first disclosures filed by OpenAI and Anthropic are being closely watched as informal guideposts for the rest of the industry.

    Both companies’ filings explicitly reference the statutory provision and are structured to touch each of the 12 required categories. OpenAI’s disclosure takes the form of a concise narrative summary, while Anthropic adopts a more formal, enumerated structure that mirrors the statute and adds limited explanatory context. Despite these stylistic differences, the substantive approach is largely the same. Neither company identifies specific datasets, data repositories, or named sources. Instead, each relies on generalized categories such as publicly available information, nonpublic data obtained from third-party partners, user-provided data subject to opt-out mechanisms, human-generated training data, and synthetic data.

    Related: AI Is Changing M&A as Regulators Target ‘Killer Acquisitions’ and Data Control 

    On intellectual property, both companies state only that their training data may include material protected by copyright alongside public-domain content. Anthropic adds that publicly available data is obtained through a general-purpose web crawler and characterizes this approach as consistent with standard industry practice, while OpenAI does not describe its collection methods in similar detail. The disclosures do not attempt to verify the licensing status of specific content, reflecting both practical limitations and the legal uncertainty surrounding large-scale data scraping.

    The same pattern appears in disclosures related to personal and aggregate consumer information. Both developers acknowledge that their training data may include such information as defined under California law. OpenAI states that it takes steps to reduce the amount of personal or aggregate consumer data in its training datasets. Anthropic notes that personal information may be incidentally present in internet-sourced data and is not used to identify or target individuals.

    Taken together, these initial filings suggest that other developers are likely to adopt a similar approach to compliance. The statute’s express allowance for estimates, ranges, and generalized descriptions gives companies substantial flexibility, and the absence of enforcement guidance reinforces incentives to err on the side of minimal disclosure. Goodwin suggests some developers may also adopt a wait-and-see approach in light of pending litigation challenging the law’s constitutionality on trade secret grounds.

    Absent further guidance or enforcement actions clarifying expectations, the first wave of filings is likely to set a de facto standard for compliance, one that emphasizes procedural adherence to the statute’s categories rather than substantive visibility into training data practices.