Artificial Intelligence (AI) has been a transformative force across industries, automating tasks and delivering insights beyond human capacity. However, as AI's capabilities grow, so do the complexities around its development and usage, especially in the realm of copyright law. In an increasingly digital world, understanding how AI intersects with intellectual property rights has become crucial.
A recent case spotlighting this intricate nexus involves prominent comedian Sarah Silverman and authors Richard Kadrey and Christopher Golden. They have filed lawsuits against Meta Platforms, formerly known as Facebook, and OpenAI, the organization behind the advanced language model GPT-3, alleging copyright infringement. The crux of their argument is that these tech entities used their copyrighted works without permission to train their AI language models.
So what is the lawsuit about?
The Joseph Saveri Law Firm, representing five plaintiffs, Mona Awad, Paul Tremblay, Christopher Golden, Richard Kadrey, and comedian Sarah Silverman lodged complaints on June 28th and July 7th respectively. These complaints outline allegations of Direct Copyright Infringement under Title 17 of the U.S. Code Section 106 against OpenAI. The plaintiffs contend that OpenAI has unauthorizedly duplicated its books during the training of the AI language models.
Adding to this, the plaintiffs profess that the AI models, in themselves, represent infringing derivative works crafted in contravention of their exclusive rights as protected under the Copyright Act. They put forth an additional accusation of Vicarious Copyright Infringement, positing that OpenAI both administrates and reaps financial rewards from the AI language models' output.
Concurrently, they assert an infringement of the Digital Millennium Copyright Act Title 17 U.S.C. Section 1202 b. The plaintiffs claim that OpenAI purposefully expunged copyright-management data from their creative works, thereby enabling continued copyright infringement.
The complaint didn't stop there, as the plaintiffs contend that the organisation is guilty of contravening the California Business and Professions Code Section 17200 by partaking in unlawful business practices for commercial gain. The plaintiffs insist that OpenAI acted negligently under California Common Law, asserting that the organisation was obliged to extend a duty of care to them. However, the plaintiffs argue, OpenAI instead used their works to train the AI models, securing unjust enrichment and reaping profits and benefits through the unauthorised use of their works and consequently depriving the plaintiffs of their rightful benefits.
The plaintiffs are seeking relief that includes statutory damages, actual damages, restitution of profits, injunctive relief, and attorney’s fees among other demands under the Federal Rules of Civil Procedure Sections 23 a, 23 b 2, and 23 b 3.
At this point, you may be asking yourself, why does this even matter?
Well, this lawsuit and other ongoing similar matters before the court could influence everything! How AI develops, how we interpret copyright law in the digital age, who controls and benefits from collective intellectual property, and where we draw the line at the intersection of artificial intelligence and copyright law.
The examination of emerging legal challenges in terms of determining whether infringement has in fact occurred can be interpreted in the context of two fundamental components of generative Artificial Intelligience systems: the data being fed into the system and the data it produces as a result. That is, training the model and using the model to generate content and lastly examining these two components within the context of current applicable law.
Let's first address the aspect of training an AI model.
Imagine this, you've got these massive language models, let's call them LLMs, and their job is to understand and predict language. Think of it as a really enthusiastic student. They soak in all sorts of data we give them, learning patterns, and adapting their internal neurons so they can understand and respond to a wide variety of scenarios. It's a big classroom for them, and in the case of these behemoths like ChatGPT, the chalkboard is the internet itself, a treasure trove of data known as a corpus.
So, here's the tricky part, these models are pulling data from a vast universe of internet sources and let's be honest, a good chunk of that data might have some copyright strings attached. Some of these data sources might even explicitly wave the 'no-no' flag on using them this way. Now, this is where the plot thickens and we find ourselves in a legal pickle. The questions then becomes, does the AI company have the green light to copy and use this content during the training process?
Artificial intelligence serves as a dynamic instrument that propels the creative process forward. Mimicking the way an individual absorbs the complexities of language and imagery from visiting an art gallery or library, AI models ingest vast quantities of data. They discern the intricate connections between words, ideas, and both visual and textual characteristics. Through this, they assimilate the fundamental facts and structures constituting our communication systems.
This learning process furnishes them with a flexible reservoir of knowledge, which they subsequently utilise to generate novel and unseen content. These fresh compositions are not present in their training data and may indeed be unique, having never appeared elsewhere.
These models do not depend on or retain any specific work from their training data. Instead, they evolve by identifying recurrent patterns across billions of images and trillions of words of text. It is our contention that this process of model development represents a permissible and socially advantageous application of existing content, falling squarely within the boundaries of fair use.
But would they be liable for copyright infringement if they access illegal torrent websites to acquire the data? And if they do, can you prove it?
To understand that, let’s explore what the law says
Firstly, Originality of Intellectual Conception serves as the cornerstone of copyright law, an immutable prerequisite underscoring its very foundation which necessitates that a work must embody at least a modicum of creativity to qualify for copyright protection, demonstrating more than mere rote reproduction.
U.S. copyright law applies to any work as soon as it’s created and fixed in a tangible form, whether registered or not. It grants a set of exclusive rights to a work’s owner and protects the owner and work regarding issues of reproduction, distribution and adaptation. For actual protection of these rights in court, the creator must register work with the U.S. Copyright Office, which requires forms specific to the type of material being copyrighted and can incur certain fees.
Copyrights are different from trademarks, which largely concern branding, and also from patents, which protect inventions. A work doesn’t need to be published or publicly available to be subject to copyright, but it must be expressed in a discrete, tangible form and must be original. Though there are exceptions, copyright protection usually lasts for 70 years past the death of the work’s creator, after which the copyright is either renewed by the work’s successive or purchasing owner or the work enters the public domain.
In a recent Senate Judiciary Committee hearing on AI and Copyright held on July 12th Professor Matthew Sag of Emory University School of Law purported that training generative AI on copyrighted works can be considered fair use, as it falls under the category of non-expressive use. Courts have recognized similar non-expressive uses, such as reverse engineering search engines and plagiarism detection software, as fair use. The distinction between protectable original expression and unprotectable facts and ideas in copyright law is relevant. Whether training in language models (LLMs) is non-expressive use depends on the model's outputs. If an LLM is trained properly and operates with safeguards, its output will not resemble its inputs in a way that would lead to copyright infringement. Generative AI is not designed to copy original expressions but rather learns from the training data like a student.
The misconception that generative AI directly copies training data is addressed. Machine learning models are influenced by the data but do not literally copy it. The only copying that occurs is during the assembly and preprocessing of the training corpus. To ensure copyright compliance, companies should adopt best practices and minimise the risk of infringement. Professor Sag emphasizes the importance of these best practices in his written submission.
He believes that the current U.S. copyright system does not require a major overhaul for generative AI. Instead, any new legislation should focus on clarifying the application of existing fair use principles. He said that other countries, such as Israel, Singapore, South Korea, Japan, the United Kingdom, and the European Union, have already incorporated fair use or specific exemptions for text data mining into their copyright laws. It is important for copyright laws to encourage responsible development of generative AI. If laws become overly restrictive, corporations and researchers may move their technological advancements to countries with more favourable conditions, undermining the competitive advantage of the United States.
While numerous field experts maintain that the AI-generated content isn't just a simple reproduction, but rather a transformative output, Silverman, alongside other creatives, assert that ChatGPT can provide accurate summaries of their books when given relevant prompts. The lawsuit suggests that such capabilities wouldn't exist unless the AI model was trained using the copyrighted materials. However, considering ChatGPT's training involves billions of internet texts, it's probable it encountered summaries, discussions, and references to these books in online articles, social media posts, and comments and then transformed the output for fair use based on prompts.
Under the guidelines of the Fair Use Doctrine in U.S. law, there are four factors that courts analyse when examining fair use defences. The first factor explored is the objective of the disputed use. Subsequently, they consider the nature of the copyrighted works. The third element is the quantity and substantiality of the content taken from the copyrighted work. Finally, they evaluate the impact of the disputed use on the market value or demand for the copyrighted work. Even though the factors related to the purpose and market effects are often given substantial consideration in fair use cases, a balanced analysis requires all four factors to be weighed collectively.
Presently, courts are wrestling with defining the boundaries of what is deemed a "derivative work" under intellectual property laws. Interpretations may differ across federal circuit courts depending on their jurisdiction. The outcomes of these cases are expected to hinge on the interpretation of the fair use doctrine. This doctrine allows the use of copyrighted work without owner's permission for purposes such as critique (including parody), commentary, news reporting, education (including multiple copies for classroom use), scholarship, or research, and for using the copyrighted content in a transformative manner that was not initially intended.
Google's unauthorised digital conversion of copyright-protected works, its creation of a search feature, and its display of excerpts from those works are deemed non-infringing fair uses. The goal of the duplication is highly transformative, the public text display is confined, and the disclosures do not significantly substitute the market for the protected elements of the originals. Google's commercial orientation and profit motive do not merit a fair use denial."
Even though the decision in the Google Books case is specific to that particular scenario, it seems to support the idea that using copyrighted materials as training data for a generative AI system could be within the limits of fair use. Once the model is trained, it doesn't directly incorporate the corpus used for training, so individual works are not discreetly stored in the model. This is somewhat akin to how information is stored in a human brain. When you use the model to create content, you're generating new, unique content that didn't exist before.
Generative AI is designed to leverage existing works as a foundation for crafting entirely new creations. As such, the reflection of original works within these novel AI-generated pieces may be so minimal or altered as to be virtually untraceable. However, this isn't an unrestricted pass for utilising copyrighted materials in all applications of generative AI; each use case still requires careful consideration and adherence to legal and ethical standards.
As emerging technologies raise novel copyright queries that weren't foreseen by legislative bodies, the courts often determine the best course of action based on the guiding principles outlined in the Constitution.
The Constitution empowers Congress to “advance the growth of Science and useful Arts”—in essence, to cultivate and share knowledge for the collective benefit of society.
To achieve this, a careful equilibrium must be struck. On one hand, copyright holders possess valid rights to thwart unauthorised uses of their works that could stifle their motivation to create. On the other hand, the creators of groundbreaking technologies and subsequent innovators also have valid rights—they need a certain degree of flexibility to facilitate their own innovative endeavours.
So, what are the chances that Sarah Silverman and her fellow authors will emerge victorious in this AI face-off?
According to James Grimmelmann, a professor at Cornell Law School who has extensively studied the Google case and is closely monitoring AI advancements, he remains sceptical about the authors' chances of success in their copyright infringement lawsuits. He agrees that AI developers have substantial legal precedents to rely on. However, he expresses some sympathy towards the notion that certain AI models may indeed infringe on copyrights. He points out a key distinction between AI and Google Books, as some AI models could potentially generate infringing works, unlike the snippet view feature in Google Books, which was specifically designed to prevent output infringement. This distinction influences the fair use analysis, although there are still multiple factors that support the transformative use argument.
Grimmelmann also highlights the potential complication of AI models being trained using illegal copies obtained from pirate websites. He explains that under a conventional copyright analysis, if the output is not infringing and the internal process is transformative, it could be considered fair use. However, he notes that some courts may take into account the source of the copies, including the alleged "unsavoury origins," when conducting the fair use analysis.
Hence, the question of whether a generative AI system can utilise copyrighted input data in this case remains unresolved. While the generation of new works with minimal display of copyrighted content is a promising aspect, the potential for users to produce works that closely resemble protected works or act as market substitutes raises concerns. Ultimately, striking a balance between transformative innovation and safeguarding copyright interests is a crucial aspect that requires careful consideration and legal guidance in the evolving landscape of generative AI.
Ben Brooks, head of Public Policy at Stability AI proposed interim solutions such as
Voluntary opt-outs, so that creators can choose whether they want their work to be used for AI training.
Implementing features to help users identify AI content, images generated through digital platforms can be digitally stamped with meta-data and watermarks to verify that the content was generated with AI.
Help tech platforms distinguish AI content before amplifying it online
Established layers of mitigation to make it harder to do the wrong thing with AI and easier to do the right thing.
Ben Brooks further stated that Despite dramatically altering the financial dynamics of creation, smartphones did not undermine the art of photography, and similarly, the advent of word processors did not degrade the realm of literature. Indeed, just as technological innovations like smartphones, word processors, and email have transformed their respective fields without devaluing them, so too can Large Language Models (LLMs) like GPT-4 transform the landscape of creativity without diminishing its essence.
LLMs, with their capacity to generate human-like text, can serve as a tool to inspire, aid, and enhance human creativity rather than replace it.
LLMS encourages creativity in Creation by acting as collaborative partners, it fosters
Accessibility of all types of end users by democratising creativity, it helps its users enhance, refine and polish existing content.
In essence, the value and authenticity of creativity reside in the unique human experiences, emotions, and original ideas that we bring to the creative process. While LLMs can mimic human-like text generation, they do not possess genuine feelings, experiences, or consciousness. Therefore, instead of viewing AI and LLMs as threats to human creativity, we can embrace them as sophisticated tools that augment and enrich our creative abilities, fostering a synergy between human ingenuity and artificial intelligence.
Regardless of the outcome, the class action lawsuits against the tech companies that own the most popular AI models will set a precedent that will be relevant in the future.