How to train your LLaMA ? – Copyright issues related to training of generative AI models
Generative Artificial Intelligence (AI) models have been making waves in the world and are even being termed as the steam engine of the fourth industrial revolution. These AI models are data-hungry and require a vast amount of high-quality data to generate quality outputs. As the race for generative AI heats up among developers, obtaining permission or license from every author may not be feasible due to operational, financial and time constraints. This has created a tussle between the developers of AI models and authors. The developers seek refuge under fair use and argue that works created by AI are transformed enough because, similar to human authors, AI models also create from knowledge and creativity. The developers also argue that copyright protection applies to the expression of ideas, not to the ideas or factual information. On the other hand, the authors view this as a threat to their livelihood and are aggrieved by the use of their works to train AI models without their consent. The authors are also aggrieved by the possibility of hallucination causing the dilution of their brand.
Recent Case Development
In the current competitive space, some developers relied on shadow libraries, which led to a flurry of suits in the US. For example, Meta acknowledged[1] using the ‘Books3’ section of the open-source language modelling data set ‘The Pile’ in training its LLaMA model. The Pile sourced some of its content from pirated shadow libraries. As a result, some authors, including comedian Sarah Silverman, filed lawsuits against Meta[2] and OpenAI[3]. In Silverman’s suit against Meta, the authors argued that the output and the AI model itself were derivative works, thus infringing the copyright. The court refused to accept this contention and cited a 1984 ruling[4] to state that to establish infringement it is necessary to prove that the outputs “incorporate[,] in some form[,] a portion of” the plaintiffs’ books.
In a bid to expand the application of ChatGPT to news reporting, OpenAI signed agreements with the Associated Press and Axel Springer and was also in negotiations with the New York Times. However, the negotiations broke down, ultimately resulting in the NYT filing a lawsuit against OpenAI and Microsoft before a New York court[5], alleging that ChatGPT and Copilot were built by copying and using millions of newspaper articles, investigations, opinions, reviews, and guides. The NYT’s lawsuit is unique because it alleges that the output of AI models not only “closely summarizes” NYT’s copyrighted content and “mimics its expressive style” but also alleges verbatim copying and enclosed “scores of examples” in its suit. Hence, the focus of NYT’s suit is on the output produced by the AI models.
In another interesting case, the visual artists filed a suit before a California court[6] and relied on a comprehensive set of grounds, including direct copyright infringement, vicarious copyright infringement, publicity rights, unfair competition and breach of contract. While the court dismissed the other grounds based on the facts of this case, the ground of direct copyright infringement was kept alive. This was because the developers had downloaded copyrighted images without authorisation and used them to train Stable Diffusion. Crucially, the training caused those “images to be stored at and incorporated into Stable Diffusion as compressed copies”.
Fair Use
The fair use doctrine can shield developers from infringement claims. Specifically, in the US context, if a work is sufficiently transformative in a purposive manner and does not constitute substantial or exact/verbatim copies of the copyrighted text, it may seek refuge under the fair use doctrine. A relevant landmark ruling came from the US Supreme Court refusing to reverse the Second Circuit’s finding in the Google Books case[7]. Google Books creates a database containing a digitised copy of copyright-protected work, provides a search functionality and displays a portion of the copyrighted book to its users, a practice which was held to be protected under fair use by the court. If the jurisprudence of Google Books’ ruling has to be applied, it is likely that developers of AI cannot be held liable for any storing of copyrighted work at the backend.
What does the future hold?
As generative AI models continue to be trained on increasingly vast datasets, the outputs will minimise regurgitation and increasingly bear less similarity to the original works. Hence, it will become more challenging for original authors or artists to prove direct copyright infringement based on the outputs produced by these AI systems. Nevertheless, authors may still allege that developers utilised pirated datasets and created copies of works in either the developers’ or data providers’ systems, making the developers directly or vicariously liable, respectively. However, proving such allegations may turn out to be difficult since information regarding the training datasets used to train AI is becoming very opaque, creating a barrier in initiating the lawsuit itself.
Interestingly, though, the variance in IP laws can be significant across different jurisdictions. In one of the major developments, the proposed AI Act of the EU in Recital 60i creates a new obligation for general purpose AI models to obtain the authorisation of the concerned rightsholder before using the same in the training of the AI model. Further, Recital 60k mandates that developers publicly disclose “a detailed summary of the content used for training” the AI models, and Recital 60j makes the proposed law applicable to AI models being trained outside its jurisdiction.
While authors welcomed these proposed legislative changes, the developers raised concerns. The developers are concerned that stringent legal controls over the use of scraped information or compelling developers to negotiate the licensing fees for training data would deter technological progress, lead to significant financial losses, and impact the overall digital ecosystem. The developers also believe that mandating disclosure of trade secrets, such as training methods and sources of training data, would erode their competitive advantage and stifle innovation. Another possible outcome could be similar to Meta removing Canadian news from its platform in reaction to the Online News Act. Hence, the developers might decide to pay more attention to the content of a few favourable jurisdictions, which could be troublesome for the diversity of the knowledge economy and societal progress. On the other hand, if no mechanisms are established to support the original creators, it may have chilling effects on the already struggling original content creators and may bring down creativity.
I believe that the way forward would be to encourage, and not compel, licensing mechanisms where the licensing could be based on metrics such as the number of times the AI model uses the copyrighted content in its training and/or output, depending upon feasibility. Hence, deeper collaborations between the original creators and developers would play a crucial role in balancing creativity, originality, and technical innovation. One example of such collaborations would be where the original creators work alongside developers in joint partnerships and play an active role in developing and improving AI models.
[1] Hugo Touvron et al., LLaMA: Open and Efficient Foundation Language Models (arXiv, 2023) arXiv:2302.13971 [cs].
[2] Kadrey et al. v. Meta Platforms, Inc., 23-cv-03417-VC (US District Court, N.D. California).
[3] Silverman et al. v. Open AI, 3:23-cv-03416 (US District Court, N.D. California).
[4] Litchfield v. Spielberg, 736 F.2d 1352, 1357 (US Court of Appeals, Ninth Circuit; July 6, 1984).
[5] The New York Times Company v. Microsoft Corp. et al., 23-cv-11195 (US District Court, S.D. New York).
[6] Andersen et al. v. Stability AI Ltd. et al., 23-cv-00201-WHO (US District Court, N.D. California).
[7] Authors Guild v. Google 721 F.3d 132 (US Court of Appeals, 2nd Circuit; 2015).