Comedian Sarah Silverman joins authors in suing OpenAI and Meta over AI training

Lawsuit alleges that artificial intelligence used pirated copyright content to train their large language models

Alan Martin11 July 2023

The comedian Sarah Silverman has joined authors Christopher Golden and Richard Kadrey in dual lawsuits against Open AI for its popular ChatGPT bot and Meta for its leaked LLaMA language model.

Both suits allege that the companies’ respective artificial intelligence has been trained on the authors’ copyright-protected works without their consent. A website supporting the action describes ChatGPT and LLaMA as “industrial-strength plagiarists that violate the rights of book authors”.

To develop its knowledge, artificial intelligence such as ChatGPT is trained on huge amounts of data taken from the internet. The lawsuits allege that the bots’ intricate knowledge of the authors’ works demonstrates that they were trained on copyrighted material.

The Open AI lawsuit contains evidence of ChatGPT generating “very accurate summaries” of all three of the authors’ works: Silverman’s The Bedwetter, Golden’s Ararat, and Kadrey’s Sandman Slim.

Despite ChatGPT getting “some details wrong”, the complaint states that this proves that the AI “retains knowledge of particular works in the training dataset and is able to output similar textual content”.

Sarah Silverman is an American stand-up comedian, actress, and writer

Supplied

“At no point did ChatGPT reproduce any of the copyright management information Plaintiffs included with their published works,” the complaint adds.

As to where this data has come from, the Open AI complaint notes that while the “Books1” dataset appears to be roughly the size of Project Gutenberg — a repository of copyright-free books — the “Books2” one is so large that it can only have come from “shadow libraries”. These are repositories of pirated books.

“Tellingly, OpenAI has never revealed what books are part of the Books1 and Books2 datasets,” the complaint reads, before showing its working.

“The OpenAI Books2 dataset can be estimated to contain about 294,000 titles,” it continues. “The only ‘internet-based books corpora’ that have ever offered that much material are notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Bibliotik.”

The plaintiffs in the two cases are requesting damages and injunctive relief — the latter of which could fundamentally alter the way that LLaMA and ChatGPT function.

“It’s a great pleasure to stand up on behalf of authors and continue the vital conversation about how AI will coexist with human culture and creativity,” conclude Joseph Saveri and Matthew Butterick in a post on the website supporting the action.

Comedian Sarah Silverman joins authors in suing OpenAI and Meta over AI training

Read More