Comedian Sarah Silverman joins authors in suing OpenAI and Meta over AI training

Lawsuit alleges that artificial intelligence used pirated copyright content to train their large language models
Alan Martin11 July 2023

The comedian Sarah Silverman has joined authors Christopher Golden and Richard Kadrey in dual lawsuits against Open AI for its popular ChatGPT bot and Meta for its leaked LLaMA language model.

Both suits allege that the companies’ respective artificial intelligence has been trained on the authors’ copyright-protected works without their consent. A website supporting the action describes ChatGPT and LLaMA as “industrial-strength plagiarists that violate the rights of book authors”.

To develop its knowledge, artificial intelligence such as ChatGPT is trained on huge amounts of data taken from the internet. The lawsuits allege that the bots’ intricate knowledge of the authors’ works demonstrates that they were trained on copyrighted material.

The Open AI lawsuit contains evidence of ChatGPT generating “very accurate summaries” of all three of the authors’ works: Silverman’s The Bedwetter, Golden’s Ararat, and Kadrey’s Sandman Slim.

Despite ChatGPT getting “some details wrong”, the complaint states that this proves that the AI “retains knowledge of particular works in the training dataset and is able to output similar textual content”.

Sarah Silverman is an American stand-up comedian, actress, and writer
Supplied

“At no point did ChatGPT reproduce any of the copyright management information Plaintiffs included with their published works,” the complaint adds.

As to where this data has come from, the Open AI complaint notes that while the “Books1” dataset appears to be roughly the size of Project Gutenberg — a repository of copyright-free books — the “Books2” one is so large that it can only have come from “shadow libraries”. These are repositories of pirated books.

“Tellingly, OpenAI has never revealed what books are part of the Books1 and Books2 datasets,” the complaint reads, before showing its working.

“The OpenAI Books2 dataset can be estimated to contain about 294,000 titles,” it continues. “The only ‘internet-based books corpora’ that have ever offered that much material are notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Bibliotik.”

The plaintiffs in the two cases are requesting damages and injunctive relief — the latter of which could fundamentally alter the way that LLaMA and ChatGPT function.

“It’s a great plea­sure to stand up on behalf of authors and con­tinue the vital con­ver­sa­tion about how AI will coex­ist with human cul­ture and cre­ativ­ity,” conclude Joseph Saveri and Matthew Butterick in a post on the website supporting the action.

The Evening Standard has contacted OpenAI and Meta for comment.

Create a FREE account to continue reading

eros

Registration is a free and easy way to support our journalism.

Join our community where you can: comment on stories; sign up to newsletters; enter competitions and access content on our app.

Your email address

Must be at least 6 characters, include an upper and lower case character and a number

You must be at least 18 years old to create an account

* Required fields

Already have an account? SIGN IN

By clicking Create Account you confirm that your data has been entered correctly and you have read and agree to our Terms of use , Cookie policy and Privacy policy .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged in