A PYMNTS Company

EU Publishes Mandatory Template for Disclosing AI Training Data

 |  July 24, 2025

The European Commission on Thursday filled in one of biggest blank spaces in the AI Act with the release of the mandatory template providers of general-purpose AI (GPAI) models must use to disclose the data used in model training. Any GPAI model made available in the EU after August 2nd is required to make publicly available a “sufficiently detailed summary” of their training data, whether from publicly available datasets, privately licensed archives, or scraped from the internet. Providers of models already in use in the EU before that data have until August 2, 2027 to provide a retroactive summary.

    Get the Full Story

    Complete the form to unlock this article and enjoy unlimited free access to all PYMNTS content — no additional logins required.

    yesSubscribe to our daily newsletter, PYMNTS Today.

    By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions.

    “Today’s template adopted by the Commission is another important step towards trustworthy and transparent AI,” Commission technology chief Henna Virkkunen said in a press release accompanying the template’s release. “By providing an easy-to-use document, we are supporting providers of general-purpose AI models to comply with the AI Act.”

    The purpose of the disclosure, according to an explanatory notice attached to the template, “is to increase transparency on the content used for the training of general-purpose AI models, including text and data protected by law and to facilitate parties with legitimate interests, including rightsholders, to exercise and enforce their rights under Union law.”

    Among those with legitimate interests are holders of copyrights and other intellectual property rights. According to the notice, “This information is needed to facilitate the exercise of their fundamental right to intellectual property and the fundamental right to an effective remedy in the enforcement of their rights, as provided for in Union law.”

    Other legitimate interests include those of data subjects, information to support the enforcement of the Union data protection rules, and “the fundamental right to receive and impart information and allow researchers to exercise their freedom of science to conduct scientific research.”

    Read more: AI Action Plan Aims to Promote U.S. Leadership, Eliminate Regulations At All Levels

    The template is divided into three sections: general information; the data sources; and data processing.

    Under the first, providers must identify the model and versions being released, the modalities of the data used in training (i.e. text, image, audio, video, other), and the size of the datasets used.

    The second section, on data sources, comprises the heart of the required disclosures. It includes listing the publicly available datasets compiled by third parties used in training, such as collections available on public repositories and online platforms, or specialized websites, and a list of all private, non-publicly available datasets used, such as those licensed directly from the source of the dataset, or obtained through data intermediaries.

    Also required to be disclosed is a summary of data scraped from online sources, whether by the model provider or a third party, including listing the top 10% of domain names scraped based on the amount of data collected (top 5% for SMEs), and any use of user data from services or products controlled by the model provider.

    For domains not among the top 10%, the explanatory notice recommends providers “act in good faith and on a voluntary basis” to enable rightsholders “upon request” to obtain information on whether their data was scraped.

    The final section, on data processing, includes disclosing the methods used to identify and comply with rights reservations (i.e. opt-outs) permitted under the text-and-data mining exception and the opt-out protocols honored, as well as the methods used to remove illegal content, such as child sexual abuse and terrorist material.