Reflections on the Hamburg Court’s Ruling on LAION
The Importance of Time-of-Training Compliance in AI Data Governance
The case of Kneschke v. LAION in the Hamburg Regional Court is a significant marker in the interpretation of EU copyright law as it applies to AI development. Photographer Robert Kneschke sued LAION, a non-profit organization that created and publicly published a massive dataset of image links and metadata for AI training, alleging copyright infringement. The ruling establishes precedent for the legality of creating and using large datasets for AI training under EU copyright exceptions.
Understanding the Ruling
The court dismissed the infringement allegations against LAION, citing Section 60d of the German Copyright Law, which implements Article 3 of the EU Copyright Directive (Directive 2019/790/EU). This provision allows for text and data mining (TDM) for scientific research purposes by research organizations, even without explicit consent from rights holders.
The court stated:
"The creation of the dataset... can indeed be regarded as scientific research..."
“Die Erstellung des Datensatzes... kann durchaus als wissenschaftliche Forschung... angesehen werden.”
Furthermore, the court emphasized that LAION operates as a non-commercial entity aimed at advancing scientific knowledge:
"The defendant also does not pursue commercial purposes... this is evident from the fact that the defendant makes this publicly available free of charge."
“Der Beklagte verfolgt auch nicht kommerzielle Zwecke... ergibt sich daraus, dass der Beklagte diesen... kostenfrei öffentlich zur Verfügung stellt.”
This ruling suggests that dataset aggregators collecting data for research purposes fall under the exceptions provided by Article 3, and therefore, their activities are legally permissible.
At the same time, the ruling reinforces a critical point that Spawning has long advocated: the need for data governance mechanisms that are enforced at the time an AI model is trained. While the court's decision affirms the importance of scientific research exceptions, it also highlights the limitations of relying solely on crawling-stage restrictions for copyright enforcement. Crawling-stage TDM restrictions, such as robots.txt files or website terms of service, are applied when data is initially collected from the web, but they don't account for how that data might be used in the future. This ruling emphasizes that the purpose and legality of data use must be evaluated individually at different stages, as it may meet legal criteria at one stage that it does not at another. Dataset aggregators like LAION were exempted under Article 3 for research purposes, but that does not necessarily speak to how the LAION datasets are used later.
Here is the court’s recognition of this distinction:
"In the opinion of the Chamber, this argumentation [plaintiff] does not distinguish strictly enough between:
firstly, the creation of a dataset that can also be used for AI training (which is the sole subject of dispute here),
secondly, the subsequent training of the artificial neural network with this dataset, and
thirdly, the subsequent use of the trained AI for the purpose of creating new image content."
"Diese Argumentation [Kläger] unterscheidet nach Auffassung der Kammer nicht streng genug zwischen
zum einen der (hier allein streitgegenständlichen) Erstellung eines ‒ auch ‒ für KI-Training nutzbaren Datensatzes,
zum anderem dem nachfolgenden Training des künstlichen neuronalen Netzes mit diesem Datensatz und
zum dritten der nachfolgenden Nutzung der trainierten KI zum Zwecke der Erstellung neuer Bildinhalte."
Data may be initially collected and shared in ways that meet the research requirements of the TDM exceptions, making that dispersal of the dataset permissible. However, the subsequent training and use of the AI based on the dataset could still raise concerns about infringement.
The AI Act and LAION: Aligning Copyright Compliance with EU Regulations
The court's distinction between dataset creation and model training aligns with the AI Act's emphasis on compliance at the model level, as evidenced in Article 53(1)(c) of the Act. This provision requires providers of general-purpose AI models to “put in place a policy to comply with Union copyright law, and in particular to identify and comply with... a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790,” placing the onus on model providers rather than dataset creators. While the court ruling primarily interpreted the CDSM Directive for dataset creation in scientific research, its emphasis on the training stage indirectly supports the AI Act's focus on model-level compliance. This alignment suggests that future interpretations of AI-related copyright issues may increasingly focus on the training stage, in line with both the court's reasoning and the AI Act's requirements.
While the court did not explicitly address Article 4 enforcement or suggest time-of-training checks, its emphasis on the distinct stages of AI development could inform future interpretations of how to implement the “policy to comply with Union copyright law” mandated by the AI Act. This alignment between judicial interpretation and legislative intent strengthens the case for adopting time-of-training mechanisms as a standardized compliance practice.
From Crawling to Training: A New Focus for Copyright Enforcement
The LAION ruling implicitly shifts the focus of copyright enforcement from the dataset aggregation stage to the point of model training. This change is crucial because at the crawling stage, it's often premature to determine the ultimate purpose of a dataset. Only at the training stage can we ascertain whether the actual use of the data falls under exceptions like those in Article 3 of the CDSM or requires compliance with Article 4 of the EU Copyright Directive.
The court acknowledged this uncertainty:
"The specific application possibilities in a rapidly developing technology like AI are not definitively foreseeable at the time of creating the training dataset..."
“Die konkreten Anwendungsmöglichkeiten bei einer sich rasant entwickelnden Technologie wie der KI sind zum Zeitpunkt der Erstellung des Trainingsdatensatzes... nicht abschließend absehbar...”
Moreover, rights holders' preferences may update over time (due to a new licensing deal, etc.), and checking rights reservations at the time-of-training ensures that the most current permissions are respected.
The Anatomy of AI Training: A Case for Time-of-Training Checks
Traditionally, copyright checks (if any) have been performed at the data collection stage, often using robots.txt as a coarse proxy for rights holder preferences. However, this approach has severe limitations, as highlighted by the LAION ruling. To appreciate the practicality of time-of-training checks, it's helpful to understand the typical training pipeline:
Data Collection: Web crawlers gather data from various online sources.
Data Preprocessing: The collected data is cleaned, formatted, and organized into a dataset.
Model Training: The preprocessed data is downloaded, sometimes months or years later, and fed into the AI model for training.
Model Evaluation and Iteration: The trained model is tested and refined.
Implementing checks at the model training stage is not only feasible but Spawning would argue that is also more effective. Modern AI training infrastructures are highly modular and can stack additional preprocessing steps without significant overhead. A time-of-training check can be implemented as an additional preprocessing step, verifying the usage rights of each unit of data before it's fed into the model.
This approach allows for real-time, up-to-date compliance with rights holder preferences, addressing the court's concern about the evolving nature of AI applications. It also provides a clear point of compliance for AI developers, aligning with both the CDSM and the AI Act.
Spawning’s Do Not Train Tool Suite and DataDiligence package have always been developed with time-of-training checks in mind. The open-source DataDiligence package checks rights reservations at the time of training, which aligns with the technical realities of AI development, in which, as the LAION ruling re-emphasizes, the purpose and implications of data use often only become clear at the training stage.
Why robots.txt Falls Short: Lessons from LAION
The LAION ruling necessitates a reevaluation of the industry’s current tools and methods, particularly the reliance on robots.txt and ToS, in favor of time-of-training checks as a primary mechanism for expressing and enforcing copyright preferences.
Developed in the 1990s for web crawling, robots.txt is ill-equipped to handle the newly emerging use-cases of AI training permissions. Crucially, the LAION ruling underscores a fundamental disconnect between the time when robots.txt is checked (during crawling) and when permissions are actually needed (at training time). This temporal gap means that if a rights holder updates their robots.txt file to express new preferences about AI training, these changes will not be reflected in datasets that have already been crawled. As the ruling makes a clear distinction between the dataset aggregation stage and the model training stage, it becomes clear that a mechanism checked only at crawl time is inadequate for enforcing Article 4 compliance.
Furthermore, attempts to amend robots.txt to handle AI training permissions risk overcomplicating an already limited protocol designed for a completely different purpose.
As the Data Provenance Initiative recently illustrated in their research publication “Consent In Crisis: The Rapid Decline of the AI Data Commons,” there has been a rapid proliferation of restrictions on web crawlers associated with AI development. In just one year (2023–2024), approximately 25+ percent of tokens from the most critical domains and ~5+ percent of tokens from the entire AI-training corpora have become restricted by robots.txt. This troubling trend is leading to a more closed and less competitive environment, potentially hindering the development of public models and limiting opportunities for smaller organizations. Moreover, the research highlights significant inconsistencies and inefficiencies in the current use of robots.txt for expressing data usage preferences, including errors, omissions, and contradictions with terms of service. By implementing time-of-training checks, we can protect and preserve rights holders’ interests without breaking an open internet, stifling innovation, or over-complicating the opt-out process.
ai.txt: A Clean Slate for Community-Driven Data Governance
In May 2023, Spawning proposed and implemented an early version of ai.txt, a protocol designed to address the limitations of robots.txt. Unlike robots.txt, which is checked at crawl time and cannot account for future data uses, ai.txt is specifically built for verifying rights reservations at the time of AI model training aligning with the court’s delineation between the dataset creation and model training stages. ai.txt plans to expand to offer a more granular and flexible approach to expressing permissions, allowing rights holders to specify different levels of consent for various AI applications – from text and data mining to generative AI training. Moreover, ai.txt can be applied retroactively to existing datasets, addressing the court's concern about the evolving nature of AI technologies and ensuring that the most current rights holder preferences are respected, even for data collected under previous permissions or exceptions.
However, we want to emphasize that ai.txt is not a Spawning-owned solution, nor is it complete in its current form. Rather, we envision ai.txt as an ecosystem-wide, community-governed standard that requires input and refinement from all relevant stakeholders. We see a lot of companies trying to solve the problem for their specific organizations, but this proliferation of requirements and processes tends to further complicate the problem.
We are collaborating on Open Future’s efforts to define and categorize the “vocabulary” of TDM into a machine-readable syntax by participating in multi-stakeholder working groups and providing a model for the kind of collaborative approach needed. We encourage more initiatives of this nature and invite all stakeholders—including rights holders, AI developers, policymakers, and researchers—to contribute to the definition and refinement of the syntax and schema of TDM.
By working together to clarify definitions, consolidate them into a single file with machine-readable syntax, and implement checks at the time-of-training, we believe we can assuage some of the concerns of rights holders that have emerged from the recent ruling. Simultaneously, this approach can provide model trainers with greater confidence that they are not using data that carries legal risks—something that robots.txt and crawling-stage filtering does not ensure.
Spawning sees the LAION ruling as a clarifying development that correctly distinguishes where enforcement should happen: at the time of training. The LAION ruling is not an endpoint, but a starting point for this crucial work. Our API-based checking system, DataDiligence, allows for real-time verification of permissions at the point of model training, ensuring that the most up-to-date rights holder preferences are respected. This approach enables the retroactive application of do-not-train assertions to existing datasets, providing a path forward for responsibly using historical data—a consequential consideration given the ruling’s implicit de-emphasization of dataset aggregation being the site of enforcement of copyright law.
We remain committed to fostering an open dialogue and collaborative development process for ai.txt (and related protocols read by the Spawning API) for time-of-training checks. By building on the collaborative spirit that has driven web standards in the past, we can create a more equitable, safe, and responsible methodology for training.
For more information on Spawning's initiatives and how to get involved in the development of ai.txt, please visit spawning.ai or contact us directly.