In the EU, Opt-outs Are the Way Forward
What the EU's TDM copyright exceptions mean for researchers, developers and rights holders
This past weekend, Stability AI announced the release of their newest text-to-image model, Stable Diffusion 3. This model and Stable Cascade both removed images that were opted out in Spawning’s Do Not Train Registry from their training datasets. It’s a huge win for rights holders who did not want their works used for AI training, and we’re proud to have been able to work with both rights holders and Stability to make it happen.
While respecting opt-outs currently puts our partners, such as Stability AI and Hugging Face, ahead of the curve in the world of generative AI, we think other big model trainers will follow in their footsteps.
As of January 2024, the European Union’s (EU) latest draft of the AI Act doubles down on the EU’s earlier text and data mining (TDM) copyright exceptions and calls out their application to general purpose AI. This inclusion in the AI Act has significant implications for the role that opt-outs will play in future AI training.
We plan to talk more about the AI Act in our next post, but before we do, we want to start with some background on the EU’s TDM copyright exceptions. The rules governing the use of data remain fairly opaque, and the EU provides a rare spot of relative legislative clarity. So, what exactly are the EU’s TDM copyright exceptions, and what do they mean for researchers, developers and rights holders?
What is text and data mining?
TDM is a broad category of activity, which the EU’s Directive on Copyright in the Digital Single Market (CDSM) describes as the “automated computational analysis of information in digital form.” The broadness of the definition, however, has also raised questions about whether or not these TDM restrictions should also apply to generative AI.
Does TDM count as fair use, or does copyright still apply?
Jurisdictions worldwide are currently facing the question of how existing copyright law applies to data use at the scale of generative AI: Do texts, images and other media become merely “facts” when ingested into a training model, or does training on copyrighted works require attribution and compensation? Governments are also weighing how to support and encourage economic growth and scientific innovation through AI, while also protecting rights holders.
Many jurisdictions still haven’t formally addressed these questions. The US, for example, has yet to clearly legislate on the issue, nor has a clear legal precedent been set through the courts. In the absence of clear guidance, commercial model trainers are largely operating under the assumption that using copyrighted works to train generative AI models is sufficiently transformative to fall under fair use. US legal scholars, however, are not all in agreement, and some of the largest AI companies, including Stability AI, Microsoft, OpenAI, Anthropic and Meta, are facing current legal challenges over their use of copyrighted materials.
The EU’s Directive on Copyright in the Digital Single Market
The EU, on the other hand, has provided more updated legal guidance on these issues within their jurisdiction. In June 2019, the CDSM directives came into effect. These directives updated the EU’s copyright policies to reflect the changing digital landscape. Two articles, Article 3 and Article 4, explicitly address TDM copyright exceptions.
For researchers and AI model trainers, Articles 3 and 4 give guidance on when you can and cannot perform and use TDM with copyrighted materials. For rights holders, these articles give guidance on when copyrighted works can be legally used for TDM purposes, and what steps you can take to prevent TDM of your works.
What Are the EU’s TDM Copyright Exceptions?
Article 3 and Article 4 of the CDSM delineate the legal exceptions to regular copyright protections for TDM purposes in the EU (that is, when and how TDM can be done with copyrighted works). These essentially state:
Article 3: TDM is permitted on any lawfully accessible copyrighted works when it is done by academic research organizations and cultural heritage institutions for research purposes.
Article 4: In all other cases (including commercial applications), TDM is also permitted on any lawfully accessible copyrighted works, unless the rightsholder has opted out their works “in an appropriate manner, such as machine-readable means in the case of content made publicly available online.”
These are “mandatory” exceptions. “Mandatory,” in this case, applies to EU member states, meaning these or similar provisions must eventually be enacted by all EU member states and by future EU member states. So, these rules are intended to apply across the EU.
However, there is some leeway in how each member chooses to meet these provisions. Member states were allowed to, and many have chosen to, expand exceptions for researchers, allowing copyrighted materials to be shared and used to communicate with the public, not just duplicated. Member states have also settled on differing opt-out requirements, with some making the “machine-readable” exemplar a requirement for opting-out. The lack of a clearly defined opt-out method makes matters of consent and compliance more complicated, which is why Spawning’s Do Not Train Registry consolidates a variety of machine-readable opt-out methods into a single, comprehensive system for rights holders and developers.
The bottom line on the EU’s TDM exceptions for researchers and model trainers:
Academic research organizations and cultural heritage institutions can engage in TDM on all lawfully accessible copyrighted works for research purposes.
Everyone else, including commercial developers, can also use copyrighted works for TDM, but they cannot include those that have been appropriately opted out by the rights holder.
The Spawning API and Data Diligence package consolidates various machine-readable opt-out methods (not just those developed by Spawning), putting developers in compliance with TDM directives across the EU.
The bottom line on the EU’s TDM exceptions for rights holders:
Rights holders cannot prevent their copyrighted works from being used for TDM by research groups and cultural institutions for research purposes, when accessed legally.
Rights holders can take measures to “ensure that only persons having lawful access to their data can access them,” but these measures should not “undermine” the exception for researchers (Directive paragraph 16).
Rights holders can prevent their works from being used for TDM in all other cases, if they opt out their works in an “appropriate manner.”
Spawning’s Do Not Train Registry meets the opt-out requirements of all EU member states by consolidating multiple opt-out methods into a single machine-readable system. Rights holders opt out once, but gain protection through a variety of procedures that developers can use worldwide.
What questions still remain?
These articles of the CDSM provide some straightforward guidance for researchers, commercial AI developers and rights holders. The EU has aligned on an opt-out system as a way of honoring rights holders’ wishes for their intellectual property, and commercial developers are expected to respect those wishes. However, the directives are still subject to judicial interpretation, leaving opinions to differ on some points.
Some have questioned whether the rules for TDM exceptions should apply equally to generative AI model training, and questions of jurisdiction continue to crop up: If a server in the US scrapes data from China to create a chatbot that is then sold in Germany, do these rules apply?
These questions are now being addressed in the EU’s forthcoming AI Act, which builds on the CDSM TDM exceptions. Join us for our next post where we’ll be breaking down the key take-aways of the AI Act as they affect copyright for developers and rights holders.