Responsible Dataset Creation and Governance: Public Domain 12M
A dataset that puts quality, safety, and consent at the forefront
Spawning is excited to release Public Domain 12M (PD12M)—a fully public domain dataset freely available to download through Hugging Face. Search and view every image in the dataset at Source.Plus.
PD12M is a collection of 12.4 million PD/CC0 image-caption pairs carefully curated to train generative text-to-image models. We also release PD3M, a subset containing the 3.3 million most aesthetically pleasing images from Source.Plus. The images included in the dataset are hosted through AWS Open Data. Archiving the images in this way provides a unique opportunity to precisely modify the dataset over time, resolving data concerns while also maintaining reproducibility.
The datasets are designed to function cleanly with training routines already built around the Conceptual 12M (CC12M) and Conceptual Captions (CC3M) datasets. These popular datasets have been cited nearly 3,000 times. However, sampling of the original CC3M dataset done by Hugging Face found that over 25% of the included images had been opted out of AI training. PD12M and PD3M enable an enormous body of research to be applied while complying with EU requirements outlined in its Directive on Copyright and the Digital Single Market and AI Act.
We used Source.Plus to enrich and curate PD12M from a collection of nearly 40 million PD/CC0 images. These datasets build on Source.Plus’s goal to provide a high-quality alternative to the practice of web scraping for training data that puts data quality, safety, and consent at the forefront.
The Problem with Web-Scale Datasets
Our work with HaveIBeenTrained.com and supporting rights reservations has highlighted many of the issues that arise with indiscriminate data scraping from the open web, and we’ve seen how those issues start with the datasets themselves:
Lack of consent and compliance due to the inclusion of opted-out and copyrighted materials
Safety concerns due to the inclusion of violent and nsfw content, CSAM, and PII
Poor data quality from low resolution images, inaccurate or insufficient captions, broken links, and vulnerability to other changes to data on the web over time
Externalized egress costs charged to web hosts whenever their data is downloaded
Poor experimental control to compare model training techniques since the underlying data is subject to change over time
Limited ability to make corrections to the dataset
Today, and over a series of posts, we’ll be exploring how the PD12M datasets and Source.Plus address each of these issues. If you want the more technical nitty-gritty, check out the paper on arXiv.
Copyright
PD12M includes only content with known provenance and labeled with clear, unambiguous rights for use in AI training. To this end, we’ve been more restrictive than other Creatives Commons-based datasets by including only works labeled with a public domain mark or CC0 license.
It’s unclear whether AI image-generation can meet the requirements of CC-SA and attribution licenses. Our goal is to remove ambiguity and act in-line with the spirit of these licenses—it’s good data manners.
Quality and safety
We created a range of metadata enrichments that were used to filter the ~40 million images in Source.Plus down to the final 12.4 million images in PD12M.
PD12M was filtered based on:
minimum dimensions and file size
safety model scores
custom aesthetic scores
medium filters, to remove historical documents
human review using semantic, keyword, and metadata search to locate and remove potentially problematic materials such as non-artistic photographic nudity and ethnophaulisms
We also checked everything against the ~2B rights reservations in the Do Not Train registry and found 0 matches.
Fair server costs
We have worked with outside partners, AWS’s Open Data Sponsorship Program, to host the entire ~30TB of image-and-caption data on dedicated cloud storage. This is an important measure to protect cultural heritage institutions from hosting costs and serve strain when the dataset is downloaded.
Experimental control & corrections
Independent cloud storage also protects the dataset from degradation caused by broken links or changing online content. Significantly, it also means that we are able to fully control changes to the dataset and maintain versioning records.
This novel approach to dataset governance allows us to correct any issues that may be found in the datasets. We recognize that, when working at this scale, there will be errors despite taking additional mitigation steps. Which is why it is deeply important for dataset creators to be able to rectify those errors. We’ve designed PD12M so that if a Source.Plus user flags an included image as problematic, we can swap out that image in the datasets for another with similar characteristics.
Significantly, the public-domain status of the dataset is integral to these larger goals. Datasets that include copyrighted images will continue to rely on web-scraping because hosting the images would violate copyright.
Open source data
PD12M and PD3M are fully open source and available under a Community Data License Agreement (CDLA-Permissive-2.0). We invite you to explore the images included in PD12M on Source.Plus along with additional dataset resources and documentation.