The Making of PD12M: Image Acquisition

Nov 15, 2024

Two weeks ago, we introduced Public Domain 12M (PD12M)—a highly aesthetic 12.4 million image-text dataset that includes only images labeled with a Public Domain Mark or CC0 license indicating “no rights reserved.” The dataset is available through Hugging Face, and all images are hosted through AWS Open Data and are viewable alongside provenance and metadata on Source.Plus.

PD12M minimizes concerns about copyright, safety, and consent, while maintaining a high bar for image quality. Crucially, it provides an alternative to the indiscriminate open-web scraping of images for training data. Its method of acquiring and distributing images is a huge part of why we feel PD12M is needed. Today, we’re going to look at how the ~38 million images on Source.Plus were collected, how PD12M is hosted, and what makes this collection process different from that used for other datasets.

The Dataset Problem

A typical process for compiling a large-scale dataset might involve using Common Crawl to collect all the image URLs discovered during the latest complete internet crawl. These URLs would be matched with captions (often the alt text that is available with the image, but sometimes image descriptions or synthetic captions). Datasets may also go through a deduplication process or include a NSFW/safety score and other metrics alongside the image-caption pairs that can be used for filtering data before training.

These datasets can be incredibly large, but as the scale increases, so do the number of low-quality images (small size, blurry, inaccurate captions), safety concerns (NSFW material and CSAM that aren’t adequately filtered using the datasets’ out-of-the-box NSFW filters), privacy concerns, and the issue of non-consenting use.

While some model trainers and jurisdictions respect opt-outs and filter out those images, the opted-out content itself remains indexed in the original datasets. A sampling of the Conceptual 12M dataset conducted by Hugging Face found that 25% of the images included have been opted out by the rights holders. This shrinks the usable size of the dataset considerably when complying with EU requirements for commercial use as outlined in the Directive on Copyright and the Digital Single Market 2019 and the AI Act. The usable dataset shrinks still further when accounting for broken links that degrade datasets over time.

When a developer is ready to use a dataset for training, they download the dataset and then use the dataset’s list of image URLs to download each of the images. This can be a slow process, given the size of these datasets. Servers can also throttle download speed/access, making it a time consuming process. It can also be a costly one—for the image hosts.

Source.Plus Data Acquisition

We followed a different approach to gather the images and metadata included in Source.Plus.

Data Sources

Source.Plus pulls together images from some of the top museums and cultural heritage institutions in the world. Images come from Europeana, Wikimedia, the Smithsonian Institute, and more (find more details here). These institutions have labeled their images with license types and put out collections of public domain and CC0 images that are open for all to use.

Starting from primarily OpenGLAM institutions ultimately made the final dataset significantly higher quality. PD12M contains a range of art styles and artists sketches captured with professional, archival photography. Moreover, these are all already curated collections, with vetted information about each work’s creator, creation date, etc. And while these sources are open, they aren’t always exhaustively indexed through standard web crawling, meaning that many of these high-quality collections aren’t always part of a standard data scrape.

Wikimedia Commons provided a range of modern photography and subjects to help balance the dataset’s contents against the materials that have aged into the public domain. This source allows user uploads, making them more prone to inaccuracies. To mitigate this, we delayed ingestion of images from Wikimedia for 14 days so that the Wikimedia community would have time to flag contested images. We did not include any images under review in Source.Plus. Additionally, we filtered images marked as Pubic Domain or CC0 based on a number of Wikimedia templates to remove CC-BY and CC-SA licenses.

Provenance

Source.Plus maintains a full record of provenance for each item as well as numerous pieces of metadata. As we embarked on the image acquisition process, we knew we wanted to collect all available information about the artist, uploader, hosting institution, licensing information, and other metadata. But preserving this information introduced its own challenges. The Spawning team had to analyze the idiosyncratic data structures for each API or dataset dump to determine how to extract that information.

License Types

We are restrictive about which license types are included in Source.Plus’s collection—Public Domain Marks and CC0 licenses only. CC0 is a special license that allows rights holders “to opt out of copyright and database protection” and essentially dedicate their work to the public domain. We have a flagging system in place to remove any images that we, or users, surface that were labeled by the uploader with incorrect licensing information.

We made this decision because other Creative Commons licenses still have restrictions. For example, CC-SA licenses, as found in the Yahoo Flickr Creative Commons 100M Dataset, allow you to “remix” an image, with the provisions that you provide attribution, “indicate that changes were made,” and share the contributions that “remix, transform, or build upon the material” under the CC-SA license. We find it reasonable to interpret a model as transforming or building upon its training data, so we opted to remove images with any license types other than CC0.

Image Cloning

We downloaded each image only once to clone it and migrate it to our own cloud host. We feel this is a necessary measure to avoid disruption and egress costs for the original hosts. We used these copies in our work curating PD12M.

The images linked in PD12M are generously hosted by the AWS Open Data program, making them available to anyone who wants to use the datasets, without disrupting the original institutions. PD12M, though it includes fewer images than some other datasets, points to ~30TB of image data because the size and quality of each individual image is high, so the implications of pointing this level of download traffic at cultural institutions were very serious considerations for us. Hosting the images in this way, in conjunction with Source.Plus, has also allowed us to implement community-driven governance of the dataset.

If you’d like to go deeper, take a look at the paper on arXiv or review our Datasheet. Next week, we’ll be talking about our curation process and how we decided which of the ~38 million images on Source.Plus would make the cut for the PD12M dataset.

Spawning Blog

The Making of PD12M: Image Acquisition