Announcing Source.Plus: A Curation Tool for Non-infringing AI Training Data
Source.Plus — the next stage in Spawning’s journey toward safe, consenting, high-quality AI training data.
We’re excited to extend early access to Spawning’s next project, Source.Plus — a platform to curate, enrich, and download non-infringing media collections in bulk for AI training.
We talk to artists and developers every day who are reluctant to engage with generative AI due its use of copyrighted works. This means the most conscientious developers and most affected communities are often on the sidelines of this rapidly developing field, whereas these are the very groups that need to be steering its evolution, and they too should be able to benefit from participation with AI. Source.Plus tackles the issue of copyright head on, marking a significant step toward fully consenting model training.
The consent layer for AI
Since its founding, Spawning has been focused on addressing two core problems facing rights holders in the wake of generative AI: the lack of control over intellectual property and limited avenues for remuneration. We began with HaveIBeenTrained.com and the Do Not Train Tool Suite so that rights holders can assert control over their works, clearly indicating in machine-readable formats that they do not wish their works to be used for AI training.
Source.Plus builds on this mission by providing a consenting alternative to indiscriminate web scraping. Training data found on Source.Plus is compliant with EU regulations on TDM rights reservations, free from copyright infringement, and sourced with known provenance.
Source.Plus collections are seeded with nearly 40 million public domain and CC0 images integrated from libraries and museums worldwide. The image repository boasts exceptionally high quality, with nearly three times as many images exceeding 1024x1024 as the LAION 400M dataset, as well as a diverse range of artistic styles and photographic subjects. This current set of collections is more than robust enough to train a modern, state-of-the-art model.
Our upcoming roadmap includes an extended suite of data enrichment tools for the Source.Plus platform. We expect these to include quality and safety scoring, automatic captioning, and features such as color adjustments and intelligent cropping. As we move through the early access beta, we will be looking toward the community to learn more about which features would improve your experience, so don’t hesitate to contact us at info@spawning.ai if there’s something you’d like to see on the platform.
Remuneration for rights holders
In the coming months, we also plan to work closely with artists and rights holders to create premium collections of work that can be licensed for AI training from the rights holders themselves. The addition of premium collections is a step toward ensuring that creatives can effectively monetize their work in the AI economy. We believe that just as rights holders should be able to determine if and when their work is used for AI training, they should also be able to profit directly from its use.
We are working with creatives, rights holders, and professional groups to establish a new licensing model for generative AI. Artists will have the autonomy to set prices for use of their media collections, with licensing terms tailored to their preferences. If you’re interested in working with us, reach out at info@spawning.ai.
From the commons, to the commons
As with rights holders, we also want to ensure that value generated by the commons gets returned to the institutions that made our initial set of collections available. Based on the response to the early access beta, we will be outlining a plan to donate to cultural heritage institutions when the works they publish are downloaded.
Unlike typical AI training datasets, which are presented as collections of links for web scraping, Spawning hosts all of the images available through Source.Plus. This not only allows for a seamless bulk downloading experience, it also protects the cultural institutions who originally published these works. Web scraping can overwhelm servers and it incurs egress fees for each download. We host the images rather than externalize those costs onto cultural institutions.
With so much content coming from the public domain, the images on Source.Plus raise a special set of challenges for training models. We encourage you to learn more about the steps we are taking to address bias and other content concerns on our About page at Source.Plus.
As we have been talking with researchers about the gaps in the commons, it has been exciting to see how Source.Plus makes it possible to explore this material in an unprecedented way, fueling discovery and inspiration. Content that was previously fragmented across the web is now searchable in one place and available for broader meta analyses. In addition to model trainers and fine-tuners, we encourage artists, educators, and academic researchers to explore, ideate, and enjoy the breadth of these cultural resources.