Have I Been Trained is Back!
The LAION-5B search and opt-out tool returns with advanced safeguards to identify and prevent access to CSAM
Have I Been Trained (HIBT) is back online after an eight-week hiatus. The free search tool, which we introduced in September 2022, allows anyone to see what images are included in the popular LAION-5B dataset and add their images to our Do Not Train Registry. It returns with advanced safeguards to detect and remove CSAM. As a result, we expect slower search result times for the near future. For now, the image search and duplicate detection features will remain unavailable, but the ability to search and opt out remains.
HIBT was introduced to increase visibility into a popular AI training dataset and empower creatives to make conscious choices about how their work is used. When HIBT was first released in 2022, many artists weren’t aware that their work had already been used to train AI text-to-image models. HIBT raised awareness on this issue, notably serving a role in the Andersen class action lawsuit by showing in court that the plaintiffs’ works were in fact a part of the LAION-5B AI training dataset.
HIBT also provided the first opportunity for rights holders to opt out their work from future AI model training. Stability AI, creator of the widely used Stable Diffusion model, already promises to respect opt-outs in Spawning’s Do Not Train Registry for training of Stable Diffusion 3, and the EU’s text and data mining (TDM) copyright exceptions explicitly call for commercial model trainers to honor rights holders’ opt-out requests.
While we’ve gone on to create more efficient methods for rights holders to opt out their works, HIBT’s search still provides a visual-first opt-out method and valuable insight into LAION-5B for anyone who wants to better understand the contents of the widely-used dataset.
Why was HIBT taken down?
In late December 2023, the Stanford Internet Observatory Cyber Policy Center group, led by David Thiel, released a paper identifying their method for locating and removing child sexual abuse material (CSAM) found in LAION-5B. In response to this paper, Spawning took down HIBT until we could implement the recommendations it laid out and explore additional steps to remove CSAM from the LAION-5B content that powers HIBT.
HIBT was already blocking “Not Safe for Work” (NSFW) content from appearing in its caption search results, using LAION’s NSFW classifier. What the Stanford paper did was provide a workable method for accurately identifying specific CSAM images so that those could be removed. It also drew public attention to the idea that CSAM could be found in the LAION-5B dataset. To prevent HIBT from being used by anyone trying to access that content, we brought the site down.
How are we addressing CSAM in LAION-5B search results?
We’ve used this time to take additional steps to remove CSAM and prevent it from being accessed through HIBT. We consulted with Thiel, C3P, and PhotoDNA to ensure a robust plan to address this issue. These steps represent an abundance of caution and go beyond those of most large image-hosting websites.
First, we deleted all NSFW-marked content from our version of the dataset, meaning none of the images identified by the Stanford team can be accessed through HIBT. This step removes any potential fail point—making it impossible that a bug or hack could somehow cause the NSFW filter to be bypassed. Please note, however, that not all NSFW content gets accurately marked as such, so the LAION-5B dataset search results should still be viewed with caution.
Second, we are now running all HIBT search results through C3P’s Arachnid, a CSAM-detection API, before they appear for users. All images flagged as CSAM will be removed from HIBT results and reported to legal authorities. This filtering step makes it take longer to return search results, but we expect search speeds to improve on repeat searches once the CSAM check is completed for those images. While we hate to reduce performance, we’re taking this step to catch any CSAM that LAION’s NSFW classifier might misidentify. The Stanford study reviewed only a fraction of the dataset’s NSFW images, so it’s possible that some material remains unidentified.
Long term, we are working with C3P and PhotoDNA to develop methods to review text-image datasets at scale.
Third, HIBT’s “more like this” image search and duplicate detection features have been removed. We introduced these features to make it easier for creatives to find copies of their work online. Until the full dataset has been reviewed, we are removing these features to make it harder for anyone to intentionally search for CSAM content.
With these steps in place, and a commitment to take immediate action as we learn more about where CSAM occurs and how it can be blocked, we’re excited to be able to offer HIBT again.
Has anything changed about the free service?
HIBT will continue to run as a free service for anyone who wants to learn more about the LAION-5B dataset. Rights holders and researchers can continue to use the tool to search the dataset, but it will take a bit longer to return search results. You can also use the tool to add your own works to our Do Not Train Registry, indicating that you do not want your work to be used in any future AI training.
You will no longer be able to do a “more like this” image or duplicate search with HIBT. Instead, you can opt out your entire domain in one go with our Do Not Train registry and then use the Spawning browser extension to quickly and preemptively opt out works that are hosted across the web.