GENERATIVE AI MODELS: POTENTIAL MASSIVE COPYRIGHT INFRINGEMENT

-- By PedroLondono - 28 Feb 2024


Between September and December 2023, OpenAI? and Microsoft were hit with three significant lawsuits for copyright infringement in the training process of their Large Language Model (LLMs) for their generative artificial intelligence (gen AI) models. These lawsuits are likely to challenge the livelihood of generative AI models as we now know them. This essay seeks to explore the arguments of the lawsuits and the different potential outcomes, taking into account the possibility of ensuring the livelihood of these language models through the Creative Commons licensing system implemented by the Wikimedia Foundation.

GENERATIVE ARTIFICIAL INTELLIGENCE AND THE COPYRIGHT ISSUE


In June 2018 OpenAI? first introduced their generative AI LLM called Generative Pre-trained Transformer [GPT]. Its premise is not complex: users design prompts for the software requesting something specific, and the language model delivers in seconds, as an output, what the user requested. Though some argue that generative AI systems such as ChatpGPT? are a technological revolution and the result of the outmost technical advances, others contend that it is simply putting in practice data gathering, organizing and prediction processes that have been around for many years, and the only fact that makes them different nowadays is that there is much more available data to “feed” them, given individuals have been giving away their own information freely and sometimes unconsciously in the past decades (Moglen and Choudhary).

One key element for these generative LLMs is their training process. Similar to the way in which human beings learn throughout their lives, LLMs have been trained through a series of data and information that has been provided to them, so their technological features are able to analyze, process and use the information in what is called the machine learning process. Based the information disclosed by OpenAI, their LLMs are trained with datasets comprised of publicly available texts and information from the internet, licensed content from third parties, user-generated information. However, based on the assertions made by the Plaintiffs (Authors Guild and The New York Times) in their claims, it is very likely that these datasets were also fed with copyrighted works, without a license from their owners.

In that sense, the plaintiffs argue that the unauthorized copying of their works to “feed” this LLMs amount to a massive reproduction of copyrighted works. This unauthorized reproduction of copyrighted works would violate the copyright owners’ exclusive rights under Section 106 of the US Copyright Act. Though training a machine, per se, may not amount to a copyright infringement, just as a human being is not infringing and IP right merely by reading books, poems and encyclopedias or looking at paints and sculptures, and learning from what they see. Nonetheless, the plaintiffs allege that during the training process there was a physical act of copying the copyrighted works into a software or system, meaning a making a reproduction of the original without the authorization of the right holder. Similar to what happened in the Google Books case Authors Guild v. Google, this reproduction does violate Section 106 of the Copyright Act.

FAIR USE AND CREATIVE COMMONS - POSSIBLE OUTCOMES


US Courts have often applied the fair use doctrine to justify copyright infringements and immunize defendants from liability under some specific circumstances (especially when it comes to big and powerful tech companies, as it happened in the Authors Guild v. Google (supra) case regarding the Google Books project). Incorporated through §107 of the Copyright Act, the fair use doctrine allows specific circumstances under which the violation of an exclusive copyright would not be considered an infringement. This doctrine enables certain uses of copyrighted works for specific purposes. As established in the statute, the fair use defense is determined by taking into account four factors: (i) the purpose and character of the use, (ii) the nature of the copyrighted work, (iii) the amount and substantiality of what the infringer used from the copyrighted work and (iv) the effect of the unauthorized use on the potential market or target audience of the copyrighted work (Henderson et. al.).

These four factors are non-exhaustive, and fair use is an equitable doctrine through which the judge may find as a grounded defense even if not all the four factors are proved to weigh on the defendant’s favor (Balganesh et. al.). Based on the Google Books precedent, would not be surprising that the Judges will determine that under the first fair use factor, and policy reasons along the lines of “promoting technological developments”, the Defendants are not liable of copyright infringement (Lemley and Casey).

However, even if the courts to not find fair use, and declare OpenAI? and Microsoft’s infringement, the Creative Commons BY-SA (Attribution ShareAlike? ) from the Wikimedia Foundation could serve these big tech companies and allow them to have sufficient data to train their models. This licensing model allows anyone to use, share, and adapt Wikipedia content for any purpose, as long as they provide proper attribution to the original authors and release any derivative works under the same license (Creative Commons). This fosters collaboration, knowledge sharing, and the creation of derivative works while ensuring that the original creators receive credit for their contributions. Furthermore, this licensing scheme enables its users to access millions of different works and use them without having to pay any compulsory license or any retribution to their authors.

Moreover, these licenses could also serve the defendants in these cases as defenses, in case they actually trained their LLMs using the millions of data available through these BY-SA Creative Commons. However, it will be interesting to see how ChatGPT? and the different gen AI comply with the “ShareAlike” portion of these licenses, given the burden the users have when benefitting from these free licenses is that the source code of the models must be available to the public in a free manner, under the same terms the Creative Commons BY-SA are available (Moglen, supra).


You are entitled to restrict access to your paper if you want to. But we all derive immense benefit from reading one another's work, and I hope you won't feel the need unless the subject matter is personal and its disclosure would be harmful or undesirable. To restrict access to your paper simply delete the "#" character on the next two lines:

Note: TWiki has strict formatting rules for preference declarations. Make sure you preserve the three spaces, asterisk, and extra space at the beginning of these lines. If you wish to give access to any other users simply add them to the comma separated ALLOWTOPICVIEW list.