April 22, 2024 - This Thirty-Year AI Expert Challenges the Notion That AI Can Think

Nothing Transformative About OpenAI’s Copyright Abuses, Says New York Times Lawsuit

News publisher The New York Times said it owns over 3 million registered, copyrighted works.

On Wednesday (Dec. 27), the NYT sued both Microsoft and OpenAI’s family of operating subsidiaries for violating the bulk of the protections bestowed by them.

“Through Microsoft’s Bing Chat (recently rebranded as ‘Copilot’) and OpenAI’s ChatGPT, defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the NYT wrote in its complaint, adding that “the law does not permit the kind of systematic and competitive infringement that defendants have committed.”

“OpenAI quickly became a multibillion-dollar for-profit business built in large part on the unlicensed exploitation of copyrighted works belonging to The Times and others,” the NYT added.

Sophisticated artificial intelligence systems more broadly have upended the applications of existing intellectual property and copyright laws as they have scaled across the marketplace with buzzy, consumer-facing applications meant to capture market share and share of mind.

“I think [the lawsuit is] going to put a shot across the bow of all platforms on how they’ve trained their data, but also on how they flag data that comes out and package data in such a way that that they can compensate the organizations behind the training data,” Shaunt Sarkissian, founder and CEO at AI-ID, an AI tracking, authentication, source validation, and output data management/control platform, told PYMNTS.

“The era of the free ride is over,” he added.

Reached by PYMNTS, an OpenAI spokesperson said the firm respects the rights of content creators and owners and is “committed to working with them to ensure they benefit from AI technology and new revenue models.”

“Our ongoing conversations with The New York Times have been productive and moving forward constructively, so we are surprised and disappointed with this development,” the spokesperson said in an emailed statement. “We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.”

Microsoft did not immediately reply to PYMNTS’ request for comment.

Generative AI and large language models such as ChatGPT gain their wisdom by scraping the web. They use automated software to download countless web pages as well as natural language processing to extract information from web pages built, often at great expense, by others.

In some ways, the problem at hand is as old as the internet itself.

See also: Why Gen AI’s Creative Power Calls for Strategic Oversight

‘Large-Scale’ Content Exploitation

The NYT noted in its lawsuit that the publisher reached out to Microsoft and OpenAI in April to raise intellectual property concerns and explore the possibility of an amicable resolution that would allow a mutually beneficial value exchange, highlighting the NYT’s history of working productively with large technology platforms to permit the use of its content in new digital products, including the news products developed by Google, Meta and Apple.

But nothing came of those entreaties.

“As part of training the GPT models, Microsoft and OpenAI collaborated to develop a complex, bespoke supercomputing system to house and reproduce copies of the training dataset, including copies of The Times-owned content. Millions of Times works were copied and ingested — multiple times — for the purpose of ‘training’ defendants’ GPT models,” the NYT lawsuit stated.

While Microsoft and OpenAI have in the past insisted that their conduct is protected as “fair use,” the NYT complaint disabused the notion that the “use of copyrighted content to train GenAI models serves a new ‘transformative’ purpose.”

“Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use,” the complaint stated.

An analysis by The Washington Post in April of just one dataset used for training AI found that nearly the entire 30-year history of the internet has been scraped by tech companies looking to add to the billions, even trillions, of parameters their models are trained on.

The complaint pointed specifically to the Common Crawl dataset, which was the most highly weighted training dataset for OpenAI’s GPT-3, noting that the domain www.nytimes.com is the most highly represented proprietary source in the set.

Other news publications whose content featured prominently in the dataset include The Guardian, The Los Angeles Times, Forbes and the Huffington Post.

Read also: Google and Microsoft Spar Over Training Rights to AI Data

The Future of Publishing Is Inextricably Intertwined With AI

The suit from the NYT comes amid several other court cases involving AI. For example, the summer saw at least two lawsuits from groups of writers against OpenAI, accusing the company of training its AI with copyrighted works without their permission, and of using illegal copies of their books pulled from the internet.

Some publishers have, as the NYT tried to without success, reached commercial agreements to license their content to OpenAI, including the Associated Press and Axel Springer.

The U.S. Copyright Office has launched an initiative to study the use of copyrighted materials in AI training, indicating that legislative or regulatory steps may be necessary in the near term to address the use of copyrighted materials within AI model training datasets.

“What this case will likely do is create a benchmark of what is the economic threshold, or what are reasonable royalties, for fair use of content,” Sarkissian said. “Everyone’s going to use The New York Times as a proxy and see how it goes.”

The lawsuit requests a trial by jury but does not make a specific monetary demand. The complaint does, however, emphasize that Microsoft and OpenAI should be held responsible for “billions of dollars in statutory and actual damages.”

“If The Times and other news organizations cannot produce and protect their independent journalism, there will be a vacuum that no computer or artificial intelligence can fill,” the lawsuit stated.

For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.