Generative AI, Copyright Exceptions and the EU's AI Act
Questions answered and a look at things to come in the EU's AI Act
Last week, we talked about some of the high level implications of the European Union’s text and data mining (TDM) copyright exceptions as laid out in the EU’s 2019 Directive on Copyright and the Digital Single Market (CDSM).
Since 2019, there has been ongoing discussion about how exactly those provisions will be implemented across member states and how they apply to generative AI, especially since generative AI has developed rapidly since 2019. There have also been questions about how these rules would apply to those conducting TDM and model training outside of the EU.
With more recent versions of the AI Act available — presumed to be quite close to the act that will go into effect — we begin to see some answers to those questions.
The AI Act lays out the burgeoning landscape of AI legislation. While there are certainly other jurisdictions that will take different paths, we see the EU’s legislation as particularly significant. It will have a tremendous effect on how AI companies conducting business in the English-speaking world will have to operate, and as an early piece of legislation on the issue, we think the AI Act will anchor the conversation and create a precedent that will influence other legislative bodies.
The act itself is a much broader piece of legislation than we will discuss here. If you’re interested in digging into the specifics of the act itself, this tool lets you navigate it more easily. Today, we’re going to focus on the implications around copyright and the direction the act sets for the future of AI data requirements. All direct quotations of the act presented here are consistent with the Trilogues draft submitted to committee on 2 February 2024.
How is the AI ACT structured? Who does it apply to?
Broadly, the AI Act categorizes AI systems into unacceptable risk, high risk under Annex II and under Annex III, General Purpose AI (GPAI) and GPAI with systemic risk. Each of these uses have different levels of restrictions, regulations, reporting requirements, and oversight.
An important implication here is that a model initially developed for one use may suddenly require a different level of oversight and documentation — even different training methodologies — if it later becomes applied to a different use case, for example being used for assessment and placement decisions in an educational setting.
Unacceptable risk applies to AI used for social scoring or manipulation. Such uses are banned.
High risk uses include AI used for law enforcement, educational assessment and tracking, healthcare and several other categories. These are generally uses of AI that have the power to shape people’s lives and have a risk of perpetuating or institutionalizing bias.
GPAI is defined as “an AI model, including when trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable to competently perform a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications. This does not cover AI models that are used before release on the market for research, development and prototyping activities.”
Systemic risk in GPAI is defined as GPAI with a “cumulative amount of compute used for its training measured in floating point operations (FLOPs) is greater than 10^25.” This measure is able to evolve, and a panel will evaluate exceptions in both directions.
Rules applying to GPAI are the first scheduled to go into effect. Since this is the first and broadest category, we’ll focus on the implications for those cases. High risk uses have more rigorous requirements in terms of training data, reporting, bias mitigation, etc.
Questions from the CDSM answered
The AI Act establishes that the CDSM TDM exceptions also apply to generative AI. While there’s an argument that the CDSM authors did not anticipate the rapid advancements of generative AI, the AI Act was drafted with a full understanding of the current generative AI landscape.
The language of the act calls out the TDM exceptions directly under Article 52(c) Obligations for Providers of General Purpose AI Models. For those developing general purpose AI models, they must “put in place a policy to respect Union copyright law in particular to identify and respect, including through state of the art technologies, the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.”
This reference is to Article 4(3) of the CDSM, which requires TDM done outside of research organizations and cultural institutions to respect rights holders’ requests to be excluded from TDM (i.e., opt-out requests). Under the AI Act, this requirement is explicitly extended to general purpose AI model trainers, tasking them to identify and respect these reservations of rights. This requirement applies whether or not the model is open.
Recital 60i elaborates on this point:
Where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightholders if they want to carry out text and data mining over such works.
The AI Act also clarifies important questions of jurisdiction. Last week, we posed the hypothetical: If a server in the US scrapes data from China to create a chatbot that is then sold in Germany, do [the TDM exceptions] apply?
The AI Act signals that the answer is “Yes.” If an AI product is being sold in the EU, it has to abide by the EU’s requirements for AI.
Article 52ca outlines the responsibilities of an authorized representative, appointed by GPAI providers outside the EU who wish to introduce a GPAI into the EU market. This section of the AI Act clarifies that these providers outside the EU must maintain documentation, through their authorized representative, and adhere to the other obligations applied to GPAI outlined in the act.
Recital 60j clarifies this explicitly in context of copyright:
. . . providers of general purpose AI models should put in place a policy to respect Union law on copyright and related rights, in particular to identify and respect the reservations of rights expressed by rightholders pursuant to Article 4(3) of Directive (EU) 2019/790. Any provider placing a general purpose AI model on the EU market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of these general purpose AI models take place.
With the proliferation of AI companies creating tools built around GPAI models, there will be a number of issues with bringing these products into EU markets. New models, trained on public domain and rights-respecting datasets, will need to be developed to support these requirements in the EU.
This all means that the AI Act will have global impact.
Public transparency of training data
Data transparency requirements are part of the EU’s strategy to support rights holders in expressessing their reservation of rights and a tool for ensuring compliance.
For GPAI providers, Article 52c(1a) makes a requirement of technical documentation, which is specified in Annex IXa. It asks, among other things, for a general description of the model, specific details related to the model design, its training, and its energy consumption, as well as information on the training data used:
information on the data used for training, testing and validation, where applicable, including type and provenance of data and curation methodologies (e.g. cleaning, filtering etc), the number of data points, their scope and main characteristics; how the data was obtained and selected as well as all other measures to detect the unsuitability of data sources and methods to detect identifiable biases, where applicable
These requirements add overhead for AI researchers and developers. They also present some technical challenges. Maintaining a record of provenance, for one, is a broad and complicated ask. In April 2024, Spawning and the Open Future Foundation are convening a group of industry and policy leaders, including C2PA and the ISCC Foundation, to facilitate a discussion about how this point among many can be successfully executed in the real world.
A more specific request for data transparency can also be found in the recitals.
Previous drafts included language that asked for GPAI providers to “document and make publicly available a summary of the use of training data protected under copyright law,” as quoted by João Pedro Quintais of the Kluwer Copyright Blog. Quintais criticized this requirement back in May 2023, flagging the challenge of teasing out what, exactly, in the training data is protected under copyright law and raising the question of whether copyrighted materials should be singled out from other training data to fall under “special rules.”
Recital 60k provides the most insight into these transparency reporting requirements as they currently stand. It is deemed “adequate” if GPAI model providers “make publicly available a sufficiently detailed summary of the content used for training the general purpose model.” Already we see a key difference — a broad summary of all training data rather than a summary restricted to training data protected under copyright law.
What makes the summary “sufficiently detailed”? Well, it should:
be generally comprehensive in its scope instead of technically detailed to facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used.
What exactly these reports will ultimately need to look like is something of a mystery. The current draft promises a template from the AI Office to help construct this summary and provide guidance for the narrative explanation. It also mentions that codes of practice will cover the “the adequate level of detail for the summary about the content used for training.” These resources, however, are still to be developed.
The language of the recital, however, would seem to ask that model providers make public that their models were trained with, say, LAION-5B or through a bulk data purchase from a specific website. The intention laid out here is that these summaries should help rights holders so that they can knowledgeably take action to reserve their rights under Article 4(3).
The AI Act Signals Enforcement
More than simply offering guidelines, measures in the act — including stiff fines and the creation of regulatory bodies — suggest the EU is poised to enforce these new requirements through the AI Office.
Under the AI Act, National Supervisory Authorities are also to be designated by each member state. These authorities will be responsible for monitoring and enforcing compliance with the AI Act within their respective territories. They will also have powers to conduct investigations, inspect AI systems and request documentation. They are empowered to enforce compliance through various measures, including issuing warnings, orders and fines.
The fines themselves are steep, exceeding the GDPR’s (General Data Protection Regulation) maximum fines of 20 million EUR or 4 percent of global annual turnover. The fine caps range from the higher of 7.5 million EUR or 1 percent of total worldwide annual turnover (for supplying incorrect, incomplete, or misleading information) to the higher of 35 million EUR or 7 percent of total worldwide annual turnover (for violation of unacceptable AI risk outlined in Article 5).
The AI Act additionally outlines the creation of a European Artificial Intelligence Board. The board is intended to facilitate cooperation and coordination among the NSAs and the European Commission. It is tasked with providing guidance, sharing best practices and ensuring a consistent application of the AI Act across the EU. Specifically, the board will have a role with regards to “the enforcement of rules on [GPAI] models.”
On-going Response
It seems clear that model developers and providers will have to respond to the AI Act. For many companies, that response may get complicated from a technical and business perspective. While there will always be companies who view penalties as the cost of doing business, we’ve seen from the GDPR how impactful this type of legislation can be.
As more legal experts have an opportunity to comment on the AI Act and we move closer to its provisions coming into force, we expect the conversation to turn from whether Gen AI developers have a requirement to respect opt-outs to how data owners’ reservations of rights can be accommodated within the spirit of the law.
The AI Act clearly intends to make it feasible for rights holders to express these rights and for developers to identify and respect them. What remains to be determined is what system is tenable for both sides of this equation. This is part of the conversation we’ve been having at Spawning since our inception. We believe that data-use reservations need to be easy to register and easy to respect. Rights holders and AI developers require a simple, universal system to register and read these reservations of rights, which is the intent behind Spawning’s Do Not Train Registry and Data Diligence package.
While specific implementations of the AI Act will surely evolve as codes of practice are developed, the EU has clearly signaled that the days of unrestricted data scraping on the open web are numbered for anyone looking to participate in the EU market.