Opt-outs that Work in the World of AI Data Scraping
A dive into the technical and practical concerns that must guide any opt-out solution
We get a lot of questions about how our opt-out infrastructure works and why we’ve set it up the way we have. Today, we’ll be addressing some of those questions. To do so, we’ll look at the practice of web scraping for AI training and get into the technical and practical concerns that have guided us to register and read opt-outs at the level of individual media files, at the time of training and across multiple machine-readable means.
Anatomy of a Web Page
When you first load a webpage, the web browser reads the site’s HTML. That usually includes the text on the page, but not the images and other media. Instead, the HTML contains links to those media files. The HTML may also include any associated alt text — a textual description of the media.
To load the media files (images, video, music, pdfs, etc.), the web browser calls those files by following their links and pulling them from the servers where they are stored. If you were on the internet in the 90s, you’ll remember when this process was slower and images rendered line by line after the rest of the page.
Scraping vs. Crawling
Web crawling and media scraping as done for AI training data are two separate processes that act at different points of a webpage’s anatomy. Crawling is the process used by search engines, such as Google and Bing, and the Internet Archive to “map” the internet. It works by following links from web page to web page and reading each page’s HTML. The crawler stores the HTML files, so it captures any media links associated with the page at the time of its crawl, but it doesn’t “see” the media itself.
Website owners have traditionally moderated crawler behavior by defining access based on user agents, most commonly using a robots.txt file. Robots.txt sits alongside the site’s HTML and leaves a message for web crawlers saying, yes, you may crawl my page or no, you may not crawl my page.
Traditionally, web crawling has been seen as beneficial because most website owners want to appear in search results. This position has become complicated because crawling alone is sufficient to train LLMs, such as GPT-4, which depend on text only. Google, OpenAI and others, do announce themselves to user agents with the crawlers that they use to train LLMs, so there is an opportunity to deny their access.
Web scraping, however, accesses media files, and it does so in a different way. We use “scraping” to refer to the process that model trainers use to download media files from the open web. The web scraper collects each image by going directly to the servers where those images are stored, bypassing the hosting website’s HTML altogether.
Training Datasets
The scraping process starts with a dataset such as LAION-5B. The LAION-5B dataset is essentially a table with billions of rows. Each row includes two major columns: an image URL and a caption, which is a written description of the image. Additional information is also included to aid in training, such as a measure of the text-image similarity, the probability that the image has a watermark and the probability an image is “unsafe.”
The initial compilation of a dataset often involves parsing enormous archives of HTML files from common crawl’s archive to locate and extract all the media file links. It also requires cleaning up duplicate data and low-quality data, generating caption text, formatting the data consistently, etc.
It would be an enormous task for each model trainer to recreate a 5.8 billion item dataset from scratch, which leads many model trainers to rely on pre-created datasets. The choices made during dataset creation are, therefore, especially significant because a single major dataset can impact numerous models. The use of pre-created datasets also means that, for most models, the work begins not with dataset creation but with downloading training data based on the URLs listed in an existing dataset.
To download the training data, the web scraper goes down the dataset’s list of URLs and methodically follows each one directly to the server where the media is stored. It then downloads the media.
There are a couple interesting implications here:
The dataset doesn’t actually include the media. You can’t just look at a dataset and see what images are included without resolving the links it contains. That’s why a tool like Have I Been Trained? is needed to explore the dataset contents visually.
Provenance is lost. There’s no column in the LAION-5B dataset to record the HTML file where the image link originally came from or where it is hosted. Once everything is in the dataset, its origin, ownership and licensing are lost. The AI Act and its transparency requirements, however, suggest this practice may need to change.
The data can change. A dataset reflects a moment in time on the internet, but the web isn’t static. A URL that once linked to an image of a cactus could, at a later time, return an image of a platypus. This is, essentially, the concept on which Kudurru’s misdirection works — when scraping behavior is detected, the server can decline to return an image or start returning images of . . . a platypus.
Scraping acts on a piece of media on a specific server. There’s currently no effective way to register a universal opt-out for a piece of media, to say “never use any image that looks exactly like this one.” Instead, permissions have to be registered at the level of a media URL specific to a host server. That means that works hosted at multiple locations across the web need to have permissions registered at each of those locations.
The Role of User Agents
This is a question that we get a lot, so it’s getting its own section: why can’t website owners just use robots.txt to set their permissions for web scraping for AI training purposes? When a web crawler reads a web page’s HTML, it encounters the robots.txt file, which can say yes or no to crawling behavior. It’s an existing standard that most pages already have incorporated.
A significant issue for many rights holders is that their works are frequently posted on websites that they do not own and whose permissions they cannot control.
Even for those who do own their own domain, there are gaps in coverage with robots.txt. Since a web page’s HTML is typically only accessed when a dataset is first compiled — not at the time that the web contents are accessed to be used in AI training — robots.txt would usually not be read by the web scraper at the time of training.
Importantly, consent can be withdrawn at any time. Even if the original dataset compilation took robots.txt into consideration, data owners need to be able to update their preferences and have that reflected by any future training that occurs. And, just as significantly, model trainers who want to respect data owners’ preferences need to have up-to-date consent information. While possible to revisit the robots.txt file, it would add significant time and computing costs to go back to the robots.txt file for each of a billion media links.
There’s also an issue that’s created when a work from one website is linked from another website that has a different set of permissions established. The original website owner’s robots.txt permissions will not be read if the link is scraped from the new site. Ultimately, for opt-outs to work, they need to operate at the level of the media URL, since that is where the data collection is happening for each new round of model training.
Opting Out Media Across the Web
All this means that it’s relatively complicated to get permission preferences in front of the web scraper at the time of each individual scrape, prior to a new bout of training.
It’s also difficult for AI developers to become familiar with and integrate all the different possible opt-out options into their data workflow. Methodological proliferation on the rights holder side makes it more challenging for developers who are making good faith efforts to build models that respect data use reservations. Finding simpler solutions is necessary here to encourage participation (or support compliance, depending on your jurisdiction).
Otherwise, the convenient alternative for the largest model training groups may be to simply pick their own standard that works for them and ask rights holders to individually register rights reservations with each specific training group through each of their different systems. This is consistent with what Google and Meta have already begun to do. However, just as developers can’t individually accommodate opt-outs without some means of consolidation or standardization, rights holders cannot be expected to monitor and respond to a multitude of different opt-out methods and forms across platforms. Such an unreasonable demand will make it essentially impossible for data owners to control the use of their data by all potential data users. Opt-outs by platform also don’t address how smaller organizations and individual researchers will navigate data permission concerns.
Spawning views this as an ecosystem-level problem that requires addressing both sides of these practical concerns, as well as the technical process of data scraping, to make a usable system.
To that end, our Do Not Train Registry acts as a repository of data permissions that is delivered in large batches through the Spawning API at the time of scraping.
Rights holders can add any media with an individual URL to the registry through a number of channels (domain-level opt-outs, the Have I Been Trained? search and the Spawning browser extension among them).
Simultaneously, the Data Diligence package makes the process of respecting opt-outs easy for model trainers by acting within the training workflow, at the time of media download and without large efficiency costs. And because model trainers need to be able to identify all rights reservations, not just those registered with Spawning, the Data Diligence package recognizes multiple machine-readable opt-out methods.