Concerns Arise Around OpenAI’s Use of Copyrighted Material for AI Training
Recent research has bolstered claims that OpenAI’s AI models may have been developed using copyrighted content. This study, involving scientists from multiple prestigious universities, suggests that certain data memorization by these models might consist of protected works, raising significant legal and ethical questions.
Background on Legal Challenges
OpenAI is currently facing several lawsuits filed by a variety of content creators, including authors and software developers. These litigations accuse the company of utilizing their intellectual property—ranging from literary works to software code—for training its AI systems without securing proper permissions. OpenAI defends its approach by citing the doctrine of fair use, yet the plaintiffs dispute the applicability of this defense in the context of training data.
Study Details and Findings
The study in question was co-authored by researchers from the University of Washington, the University of Copenhagen, and Stanford University. It introduces a novel technique aimed at identifying instances where AI models memorize training data.
Language models, like those used by OpenAI, function as predictive engines—analyzing vast datasets to learn underlying patterns. While these outputs typically do not replicate the training data verbatim, some results may closely resemble learned content. For instance, research indicates that image models can inadvertently reproduce specific visuals from training datasets, while language models have shown tendencies to replicate language styles or phrases from the news and literature.
Methodology of the Study
The researchers employed a method focused on “high-surprisal” words—terms that are statistically rare in a given context. For example, in the phrase “Jack and I sat perfectly still with the radar humming,” the word “radar” qualifies as high-surprisal when compared to more common words like “engine” or “radio.”
By masking these high-surprisal words in excerpts from well-known literary works and articles from the New York Times, the team tested several OpenAI models, including GPT-4 and GPT-3.5. If a model successfully guessed the masked words, it suggested that the model had “memorized” these excerpts during its training phase.
Implications of the Study
Results indicated that GPT-4 exhibited signs of memorizing specific passages from popular fiction sources, particularly from a dataset including copyrighted ebooks called BookMIA. Additionally, it suggested that while the model also recognized parts of New York Times articles, this occurred at a considerably reduced frequency.
Co-author Abhilasha Ravichander, a doctoral student at the University of Washington, expressed that these findings contribute to the dialogue around the contentious nature of the data used for training models. Ravichander stated, “In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically. Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem.”
OpenAI’s Stance on Copyrighted Material
Over the years, OpenAI has advocated for more lenient regulations regarding the use of copyrighted content in training AI models. The company has established licensing agreements and offers mechanisms for copyright holders to exclude specific content from training datasets. However, it continues to lobby for the adoption of “fair use” provisions in legislation surrounding AI development.