One Size Does Not Fit All: Machine-Readable Rights Reservations
A look at the benefits and limitations of key machine-readable methods for registering rights reservations
We’ve built several tools to provide coverage of the different use cases rights holders have when looking to express rights reservations. We recommend a 3-step process to add content from across the web to the Do Not Train Registry (DNTR). But there are other machine-readable methods out there for expressing rights reservations. The EU’s 2019 directive on Copyright in the Digital Single Market (CDSM) Article 4(3) requires that all of these methods be respected by commercial model trainers who participate in the EU market.
Key Machine-Readable Opt-Out Methods
Here we’ll be taking a look at several machine-readable methods for expressing rights reservations, who they work for, and how they compare. Each of these methods is incorporated into the Data Diligence package, providing better adherence for each method and supporting model trainers to easily stay in compliance with the forthcoming AI Act.
While you may find value in any of these methods, some provide better protection than others, some are easier to implement, and still others cover important edge cases. You can think of these various methods as slices of Swiss cheese — they all have holes, some bigger than others, but stack them up and you’ll cover most of your sandwich. (Thanks to Iris Luckhaus, the fantastic artist behind our team portraits, for the analogy.)
Ai.txt is a standard proposed by Spawning to address the unique technical considerations of web scraping for AI training. Ai.txt is an alternative to registering a domain with the Do Not Train Registry. Adding ai.txt requires a little more work (we’ve created tutorials on deploying to Wordpress, Squarespace, Shopify, and other large website platforms to make the process easier). However, ai.txt has an advantage for owners of large networks of domains who may want more granular control of their settings. It also exists as a standard that any model trainer can respect, even if they don’t use our Data Diligence package.
Ai.txt is also designed to be checked prior to downloading media files, unlike many other methods in this list which check permission status after the media is already downloaded. We consider this an important feature because otherwise the media host has to pay network fees every time their content is downloaded by a scraper, even if the scraper throws out their data later. For large hosting sites, this can equate to thousands of dollars of server costs each month. Checking permissions first saves compute and money.
Pros:
Adopts the robots.txt standard, so it’s easy to respect.
Protects media files and HTML files, which includes most website text.
Provides protection when other sites include links to your media files.
Add once and it applies to your entire site.
Provides granular control over different media types and content locations.
Saves on network fees. Designed to be checked prior to download rather than after.
Make a universal no-AI declaration without impacting SEO.
Cons:
Only works on your website, not media hosted elsewhere.
Somewhat technical, but similar to robots.txt, which website owners regularly use.
C2PA manifests are metadata attached to individual media files to help record provenance and authenticity. It can also be used to include information about data use and consent. After the recent release of a language-compatible wrapper, these are on the roadmap to add to the Data Diligence package.
C2PA manifests have the benefit of being automatically added when you use C2PA-enabled cameras and in Adobe when the setting is enabled. Adjusting the settings to include consent data rather than just provenance data seems a bit trickier. If you’re already creating works in Adobe, these make sense, but there are alternatives if that’s not part of your workflow.
The manifest also attaches to the individual media files with the purpose of following that file wherever it ends up on the web. One serious limitation, however, is that some social media sites strip metadata from media files when they are posted. We were able to confirm that when a piece of media is posted to X (formerly Twitter) or Pinterest, the C2PA manifest is removed, which leaves that media vulnerable to being scraped or reposted without the manifest.
Pros:
Attaches to your media files themselves, so it can follow them, in some cases, even when copied.
Easy to add if you are already working in Adobe or with a C2PA-enabled camera.
Cons:
Has to be added to each media file individually at the time of creation, or the media has to be re-uploaded. This makes it more useful for new works than those already on the web.
Network fees. Manifests are checked after the file is already downloaded.
Isn’t designed for HTML files, which includes most website text.
Removed by some social media sites, including X and Pinterest, presenting a significant vulnerability when work is shared.
Domain registration is an easy way for anyone to add the contents of their website to the Do Not Train Registry. To do so, simply enter your domain at Have I Been Trained and select “register Domain.” You’ll need to send an email from an official email address to be verified as the website owner.
When you’ve registered your domain, all media files on your website will be registered with the Do Not Train Registry. Any additional content published on your website will automatically be added as well — you don’t need to search for it or register it separately.
We recommend domain registration for website owners who want to register rights reservations for all of the media on their site. It’s a great option for artists, musicians, and videographers with portfolio sites.
Pros:
Easy to use, non-technical option.
Only have to do it once. It applies to all of the work hosted on your domain, and it doesn’t need to be updated when new content is added.
Protects both media files and HTML files, which includes most website text, at the time of download.
Saves on network fees. Designed to be checked prior to download rather than after.
Cons:
Doesn’t protect content hosted elsewhere on the web, even when embedded on your site, e.g., YouTube videos, SoundCloud audio, or image CDNs.
Doesn’t have the level of granular control of HTTP headers or ai.txt.
Have I Been Trained? allows anyone to search the images included in the LAION-5B dataset. When you locate images that belong to you, you can select as many as you need and click to add them to the Do Not Train Registry. It’s a relatively quick way to opt out multiple images at once and doesn’t require any type of comfort with installing root files. It’s also one of the few ways to express rights reservations on your images when they are already out there on the web and hosted by someone else. If you can’t edit the image’s metadata or install a file on the hosting site, you can still opt out with HIBT.
However, it is limited in scope to images (as opposed to all forms of media files and website text) and it only includes images that have already been added to the LAION-5B dataset. It’s not a comprehensive tool, but it pairs well with some of the other methods listed here to help you find all your images. We recommend using it after you’ve already registered any domains you own.
Pros:
Easy to use, non-technical option.
Helps you find work that has been posted around the web by searching the largest, most popular AI training dataset.
Protects works that aren’t hosted on your site.
Multi-select speeds up the process of adding individual works.
Cons:
Limited to images that are already in the LAION-5B dataset.
Doesn’t cover other types of media or newly created content.
HTTP headers can do some cool things, but they are not an easy or user friendly option to do on your own. They enable a granular level of control, allowing you to apply the declarations to specific items and not to others. They are also fairly easy for developers to check, which works in their favor. But implementation is a bear, and even code-savvy people run into problems trying to get these to work correctly.
HTTP headers make the most sense for large websites who host other people’s content (and may need to have different permissions for different pieces of media) and who have a dev team in place to do the work. This is the case with DeviantArt, who puts in HTTP headers by default.
The average person looking to create and share creative works online will probably not want to go down this rabbit hole. To implement these to your site, you basically need to modify your host server, but this is a technical process and many hosting platforms don’t let you do it at all. If you want to express rights reservations for everything on your site, domain registration or an ai.txt file is going to be a lot more straightforward.
Pros:
Img2dataset, the most commonly used tool to scrape datasets, respects these by default (but users can change that setting).
Granular control over which items on your site you apply it to.
Protects both media files and HTML files, which includes most website text.
Can be applied globally, so that protection automatically applies when new content is added.
Cons:
Very complicated to add, even for technically savvy people.
Network fees. Headers are checked after the content is already downloaded.
Can only be added to sites you control, not content hosted by someone else.
Our browser extension probably shouldn’t be your first step for expressing rights reservations (we actually recommend it as step 3), but it fills a gap left by other machine-readable opt-out methods. The browser extension allows you to add media files to the Do Not Train Registry even when you don’t host the files.
Many of the most convenient one-stop tools in this list require you to have control of the hosting domain. That includes ai.txt, robots.txt, domain registration, and HTTP headers. C2PA manifests are added to the media file itself, but it has to be added at the time the media file is created.
If you’re looking to add a do-not-train notice to a piece of media that is already published on the web, that you don’t host, and that isn’t in the LAION-5B dataset, the browser extension is the only real option. We recommend the browser extension to “clean up” any stray works that are scattered across the web after you’ve already opted out the works that you host. It can also be a useful tool for tackling works posted on social media sites.
While it’s a little more technical than the options on HaveIBeenTrained.com, this is still a pretty friendly tool for most users. There is documentation to walk you through how to install the extension on different browsers if that’s not a process you’re familiar with.
Pros:
Protects your work even when it’s in the hands of someone else. If someone copies your image and puts it on Instagram, you can still add it to the Do Not Train Registry.
Protects both media files and HTML files, which includes most website text, at the time of download.
Can protect all of your work, not just images in LAION-5B.
Cons:
You have to know where to find your work.
Must be comfortable with browser extensions. Ours is open source.
TDMRep is a web protocol created through a W3C Community Group to address rights reservations under the EU’s CDSM Article 4(3) (note: it is not a formal W3C standard). In addition to a TDM reservation, it also allows for the inclusion of a TDM policy, which provides information about TDM Licenses that may be obtained from the rights holder.
TDMRep allows for multiple techniques for embedding reservations—as a JSON file on the origin server, in HTTP headers, in HTML content, or as metadata on EPUB files. This last use case seems to be the one that has gained the most traction in the EU.
While TDMRep’s specifications recommend choosing one technique to use, Agents (those engaged in text and data mining) are instructed on how to process priorities across the different techniques. However, users may face an additional level of complexity in trying to determine the most appropriate method for their content, and they may find that they have gaps in coverage based on the technique used, especially if there’s a mismatch with their needs.
The pros and cons depend on the technique being used, and will be somewhat similar to ai.txt (for origin-server files), HTTP headers (for TDMRep HTTP headers), or C2PA (for EPUB metadata). In addition:
Pros:
Multiple ways to embed (HTTP headers, HTML, or metadata on EPUB files) based on your needs.
Inclusion of a policy to indicate work can be licensed under conditions.
Cons:
TDM Agents have to check at all three locations (headers, HTML files, and metadata) to determine which permissions setting supersedes the others.
Complexity for individual users to select the correct method to use for their specific case, and overall a somewhat high level of technical complexity.
Can’t use to opt out images hosted by others (but can opt out EPUB files hosted by others or images controlled by you)
What about Opt-Out methods not covered here?
Congratulations if you’re still with us! At this point you may be thinking, what about this other opt-out method I heard of? There are other ways to opt out, and we’re going to address a few of them and why they haven’t been included in the Data Diligence package.
Robots.txt: Anti-crawling notices
Robots.txt is a plain text file that sits at the root level of your site and indicates whether specific crawlers are allowed to crawl the site. One example is robots.txt rules targeting specific user agents, for example Google-extended, which allows you to block crawling for Google’s Bard and Vertex AI (note it does not appear to block the website snapshots used for Google’s “Search Generative Experience”). These rules only work if the user agent respects the declaration.
We’ve already written about some of the technical constraints that make robots.txt fall short when it comes to AI training. We don’t incorporate it into the Data Diligence package because it targets individual crawlers, not AI data scraping.
An additional major limitation is the need to continually add new rules whenever new crawlers are introduced. With robots.txt there is the option to block all crawlers, but if you want to show up for search engines, you have to define a new rule specific to each crawler. There are some resources tracking what crawlers are out there, but we view this as a lot of effort on the part of rights holders to stay on top of, and it’s unlikely that you’d always be able to preempt newly launched crawlers before they actually start crawling. While this is difficult enough for individuals with a single website, large rights holders may need to update permissions across hundreds of websites, over and over and over again.
Overall, the multiplicity of possible crawlers means we don’t see robots.txt as an ideal long-term solution for expressing rights reservations for TDM and AI training, but we still recommend that you include them on your website for the best coverage. Google gives a primer on how to add robots.txt files to your site.
Other Piecemeal Opt-outs
By this we mean opt-out requests that are individual to specific LLM training groups. As with robots.txt rules, these are too varied and too vulnerable to over-proliferation to be a tenable solution for rights holders, and we agree with the argument that these go against the expectation of the EU’s TDM exceptions.
Microsoft’s documentation about their own individual, voluntary opt-out standard raises a concern over these piecemeal opt-outs from the developer side, saying, “it is essential, though, that ‘opt outs’ not discriminate among AI developers. . . .”
Meta’s response has been even more limited. According to a spokesperson, “depending on where people live, they may be able to exercise their data subject rights and object to certain data being used to train our AI models.” They offer a form that users can submit with proof that personally identifiable information appears in the output of Meta’s generative AI models. We do not see that this addresses rights reservations with regards to AI training. A close look at Meta’s wording suggests that it would not even necessarily address personally identifiable information used in training data, only specific cases in which that information appears in the model’s output, and only when that information comes from a third-party data source, not from Meta’s own platforms, and only as necessary as “consistent with your local laws.”
Terms of Service
It is still to be determined whether terms of service can meet the standard of machine readability. While they are mentioned in the (non-binding) recitals of the CDSM, it is difficult to promote these as machine-readable given how data scraping occurs. Some do argue that the abilities of LLMs make terms of service machine readable; however, LLMs currently also “hallucinate” and few are looking for probabilistic adherence to their legal terms.
Our recommendation is that terms of service shouldn’t be the only method of rights reservations that you use. If the EU determines that these meet the standard of machine-readability, we’ll make updates accordingly.
I Announced it on X
The technicalities make this unrealistic as a method of rights reservation. We don’t believe this meets the standard of machine readability, and it is not attached to each piece of media as the works are being scraped. AI developers meeting a standard of best effort would not be able to avoid media that was copied without information about provenance, so it isn’t an effective means of expressing right reservations if your goal is to keep your IP out of AI models.
That said, in the days of Twitter, the API access would have allowed a bot to search through your posts for image links and add any links that it found to the Do Not Train Registry. Since X has closed their API, it’s prohibitively expensive to build a bot like that.
Interested in learning more about a method you don’t see here? Let us know.