The Making of PD12M: Dataset Curation
How we went from 38M PD/CC0 images down to a high-quality 12.4M-image dataset
This is the fourth post in a series about the Public Domain 12M dataset (PD12M). If you missed the other posts, here’s a quick recap: PD12M comprises 12.4 million image-caption pairs, making it the largest dataset of public domain- and CC0-marked images for training text-to-image models. PD12M was curated from a collection of ~38 million images available through Source.Plus, which also allows for ongoing community governance and stabilization of the dataset. You can read more about how all of those images were acquired in our last post, or get into all the technical details of the dataset in the paper. PD12M is open and free to download on Hugging Face, and the images themselves are available through AWS Open Data.
In today’s post, we’ll be reviewing the mechanisms and decisions behind our curation process to show how we determined which images would make the cut for PD12M and its 3M subset.
Source.Plus as Curation Platform
Source.Plus allows users to curate their own image-caption datasets, and as we constructed PD12M, we built and tested new features to improve curation and metadata enrichment.
We enabled search to locate and review items visually, including reverse image search, metadata search, and semantic search (e.g., I want to find images of “a red sports car”—even if the images aren’t titled as such). We layered these search features over a robust faceted search engine and added data enrichments for every image on Source.Plus. These data enrichments enable Source.Plus users to search and filter by subject, image size, aesthetic score, and a lot more to construct their own collections, so you can follow a similar method as what we used for PD12M.
Curation Process
A lot of heavy lifting for curation was integrated into the initial acquisition of images that are available on Source.Plus. This process involved filtering by license type, pulling in metadata and provenance information from hosting institutions, and incorporating additional checks for sites that allow user uploads (such as Wikimedia Commons). With that baseline in place, we set about finding the most suitable, high-quality images to include in PD12M.
Here’s how we applied the tools in Source.Plus to filter down from ~38M images to 12.4M:
Medium tags
We used an internal model to indicate the medium of all images on Source.Plus, making it easy to restrict search results to oil paintings, sketches, watercolors, photographs, etc. Since we want PD12M to be able to train models representing the widest range of styles, we included almost all medium types in the dataset.
The notable exception was document scans. Some library sources have vast digital collections of book and historical documents. As a result, we filtered out 8.7 million document scans.
Resolution
The entire Source.Plus collection can be filtered based on minimum resolution requirements. We set the minimum threshold for PD12M at 256x256 pixels, meaning the smallest side of any image in the dataset is at least 256 pixels.
NSFW scores
Images on Source.Plus can be filtered based on NSFW scores. For PD12M, we exclude works with a score greater than 0.5. We also manually reviewed the dataset using semantic search tools to remove instances of nonartistic photographic nudity. Fair warning, the dataset does include a wide range of paintings, sketches, and photographs of artistic nudes, so you should still exercise caution if you’re actually at work.
Ethnophaulisms
We did a manual check against the complete list of ethnophaulisms listed on Wikipedia. For each term, we searched both the image’s metadata and conducted a semantic search. We manually flagged and removed derogatory images and metadata.
Deduplication
We removed images that had high levels of similarity to others. We prioritized the like-image from GLAM sources (galleries, libraries, archives, and museums vs. sources such as Wikimedia Commons that allow user uploads) and higher-quality images.
Aesthetic scores
We used a custom model trained on an internal dataset of human ratings to give each image in Source.Plus an aesthetic score. You can use the aesthetic score to sort any of the images on Source.Plus, including your own images in a private collection. These scores represent the model’s understanding of how a human would rate an image based on composition—it’s not an indication of whether a piece of art is good or bad.
That said, we found this enrichment to be a valuable filtering tool to eliminate lower-quality images. It also tended to “downgrade” museum images that include rulers, mats, and labels, which we wanted to minimize in the dataset. The model also reflects modern sensibilities around aesthetics, so it was a useful lens to lay over the public domain works.
Here’s a snapshot of images representing different aesthetic scores:
As our final curation step we excluded images from the bottom 50% of aesthetic scores. This cutoff allowed us to match the original size of CC12M (12.4M). To match the original size of CC3M (3.3M), we excluded the bottom 90% of images by aesthetic scores.
Subject tags
We trained an in-house model to create subject-matter tags (e.g., portrait, dogs). While we didn’t filter based on these tags for PD12M, they are incredibly useful for the curation of smaller datasets for fine-tune training. We anticipate releasing the tags in a future version of the dataset that accompanies our Public Diffusion model.
What’s next for PD12M?
PD12M is the basis for our upcoming Public Diffusion model, being developed now. The preliminary results have been exciting, with large-scale images in a variety of aspect ratios. Stay tuned on X with @spawning_ and @JordanCMeyer for some sneak peaks from our training run, and reach out to me at laura@spawning.ai if you’re interested in joining the private beta!