A significant concern regarding the diffusion of Artificial Intelligence models concerns the protection of copyrighted material. The issue relates particularly to general-purpose AI (GPAI), which use deep learning techniques to perform a wide range of distinct tasks. However, in order to train GPAI models, they must be trained on very large amounts of data, which, in practice, often rely on publicly available data and data from the web. Such data is collected using web crawlers, programmes that navigate the web to perform a defined set of functions.
In the EU, the Information Society Directive (Directive 2001/29) and the EU copyright Directive (Directive 2019/790) protect the economic and moral rights of authors. Moreover, the two directives also provide for some exceptions. In particular, Article 5(1) of the Information Society Directive exempts temporary acts of reproduction as a part of a technological process from copyright protections. Additionally, Articles 3 and 4 of the EU Copyright Directive create two exceptions for ‘text and data mining’ (TDM) purposes, which is defined as an analytical technique used to generate information, such as finding patterns and trends. The new AI Act includes two provisions relating to copyright, the first of which (Article 53(1)(c)) obliges GPAI providers to comply with existing copyright law, while the second one (Article 53(1)(d)) lays down transparency requirements, stating that GPAI providers must render public a summary of the training content used.
Nevertheless, researchers are concerned about the potential presence of copyrighted material in GPAI training datasets, as legal safeguards are insufficient to prevent this. Scholars highlight the absence of a TDM exception to the right of communication to the public, which could be invoked once GPAI models produce output containing copyright-protected works. Additionally, the existing TDM exceptions are criticized for a lack of legal certainty in relation to training GPAI models with copyright-protected material. The first TDM exception contained in the EU Copyright Directive (Article 3) applies to research organisations and cultural heritage institutions, in relation to scientific research. The uncertainties in the extension of this exception to GPAI training data concerns the technical applicability and the ambiguity in the condition of “lawful access” contained in the provision. The second TDM exception (Article 4), the so-called opt-out exception, permits the reproduction and extraction of copyrighted works as long as the right holder has not expressly refused. However, it is unclear what kind of ‘machine readable’ language rights holders should use, nor is it clear the duration that the exception permits keeping the reproduction of works.
In 2023, several Member States set up a Copyright Infrastructure Task Force. The majority of the Member States, while finding that the TDM exceptions do not include in their scope copyright use for AI training, specific legislation on the topic is not necessary. Commission Virkkunen suggested, in October 2024, licensing agreements between AI companies and creative industries. Finally, in view of the revision of the EU Copyright Directive, set for June 2026, reference should be made to the AI Act’s GPAI Code of Practice.
