Last Thursday, we announced Public Domain 12M (PD12M), an image-text dataset with 12.4 million image-caption pairs. The dataset has already surpassed four thousand downloads and reached #2 on Hugging Face’s trending datasets. We’re thrilled by the response, and we can’t wait to see what the community does with it!
As more people dive in, we’ve been asked about how to participate in the community-driven aspects of PD12M’s governance. Read on to learn more.
What does “dataset governance” mean, and how is it community-driven?
Dataset governance is a means of continuing to monitor and maintain the dataset. We do this by auditing the contents—reviewing and removing problematic images—while stabilizing the dataset as a whole—replacing flagged images with new images that match characteristics, such as size, aesthetic score, and perceptual similarity.
We recognize that despite best efforts, there will be images that slip through during the curation process in a dataset of this size. Public domain works in particular raise concerns over bias both because of the time period in which many of the images were created but also because of the overrepresentation of Western institutions in open, digitized cultural collections. Rather than ignoring these concerns, we are creating a system that acknowledges they exist and responds to them quickly.
We set up Source.Plus so that datasets can be fully visible and easily searchable. For those familiar with HaveIBeenTrained.com, a tool that brought transparency to the contents of major datasets like LAION-5B, we’ve incorporated similar search features into Source.Plus, including reverse image search, and added new ones.
Source.Plus takes this transparency effort even further. We’ve incorporated a more robust range of search capabilities (semantic and faceted search), and we display a full record of provenance and metadata alongside each image. To support community-driven audits, Source.Plus allows users to flag images for any reason. These flagged images are automatically removed from view. We review these flags internally and can swap out flagged images without disrupting the dataset’s statistical characteristics.
For a deeper dive into the how and the why, check out the paper on arXiv.
How can I participate in auditing the dataset?
Go to Source.Plus, and search within the PD12M collection. Play around. There’s a lot of cool stuff in there! If you notice a problem, flag the image. Use the free-text field to tell us what the problem is—you may find a reason to flag something that we didn’t anticipate.
How can I contribute to PD12M and the Public Diffusion model?
We believe there can be a flourishing AI commons that supports public domain models. We’ve enabled any Source.Plus user to create a public collection and upload their own CC0 work. When an image is uploaded to a public collection (always under a CC0 license), it will be available to anyone interested in training or fine-tuning within a framework of mutual consent, and it can be used to fill the gaps in PD12M created through the auditing process. Sharing your work in this way makes it instantly searchable, and Source.Plus adds captions and other metadata enrichments for you and others to use. This is also true for private collections, which don’t require works to carry a CC0 license.
Our CEO, Jordan Meyer, and CTO, Nick Padgett, are heads down on Spawning’s forthcoming Public Diffusion model—a foundation model trained entirely on public domain and CC0 materials. If you’d like your works included in the model, we welcome you to add them to Source.Plus before December 01, 2024.
If you'd like to help us train Public Diffusion or beta test it in-progress, please reach out to me at laura@spawning.ai. Please also write if you are interested in contributing to the AI data commons but still have questions.