Welcome to Popular Information, a newsletter dedicated to accountability journalism that is always written by human beings. In television ads, AI tools appear to work like magic, generating useful information or smooth prose in response to a wide variety of prompts. In reality, these tools, known as Large Language Models (LLMs), are trained by AI companies on enormous quantities of text and images produced by humans. These AI companies have billions in funding, but in nearly all cases, the authors of these works are not compensated. In August 2024, three authors — Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, filed a class action lawsuit against Anthropic, which created the popular AI chatbot Claude. Anthropic was founded by former OpenAI staff members. The splinter group believed a new company was necessary to refocus on the purported original mission of OpenAI: creating ethical AI that benefits everyone. The company says it makes "decisions that maximize positive outcomes for humanity in the long run." Books are particularly valuable for LLMs precisely because humans have taken such care to produce them. If you want to teach a computer the nuances of many different topics — and how to write clearly — there is no substitute for books. But when it came time to train Claude, Anthropic did not start by buying books. Instead, it "downloaded for free millions of copyrighted books in digital form from pirate sites on the internet," including books by the three authors who filed the lawsuit. Anthropic CEO Dario Amodei admitted there were "many places" where the company could have purchased the books, but it chose to steal them to avoid the "legal/practice/business slog." Over the course of a few years, Anthropic downloaded over seven million books to train Claude. In response to the lawsuit, Anthropic claimed that using these pirated works to train Claude was "fair use." This week, a federal judge, William Alsup, rejected Anthropic's effort to dismiss the case and found that stealing books from the internet is likely a copyright violation. A trial will be scheduled in the future. If Anthropic loses, each violation could come with a fine of $750 or more, potentially exposing the company to billions in damages. Other AI companies that use stolen work to train their models — and most do — could also face significant liability. But Alsup ruled in favor of Anthropic on a related issue that ultimately may prove more important. While Anthropic originally downloaded the pirated books, it reconsidered that approach. Anthropic began purchasing millions of books, tearing the pages from their bindings, and creating digital copies for training purposes. These new digital copies would then be duplicated multiple times to train various AI models. Alsup found that as long as the original copy is purchased legally, books can be copied and memorized by AI models without the author's permission. In making this argument, Alsup anthropomorphized AI models, saying what Anthropic is doing is no different than a person reading a book and then using that information to write a new text:
Of course, LLMs are not people. They do not read or write. They are software products created by corporations with billions in funding that are capable of storing the exact contents of millions of books. The models, once trained on these books, are capable of quickly producing exact replicas or similar works that could be used for the same purpose. Although Alsup claims otherwise, this could significantly degrade the value of the original work. For example, one of the authors who sued Anthropic, Graeber, published a book in 2018 called "The Breakthrough: Immunotherapy and the Race to Cure Cancer." The book documents the development of cancer immunotherapies that culminated in a Nobel Prize. Claude, in part because it was trained on Graeber's book, could create a narrative of the rise of cancer immunotherapies in the style of Graeber. Some Claude users could read that narrative instead of the book. The ruling, should it be upheld on a likely appeal, would be a major blow not just for book authors but millions of other writers and creatives. The New York Times, for example, has sued OpenAI, alleging that its ChatGPT product uses the New York Times’ copyrighted reporting without permission. If Alsup's ruling were applied to the facts of that case, ChatGPT could train its models on the New York Times reporting for the cost of a single subscription. Similarly, major music labels have sued two companies that "released AI programs that enable users to generate songs from text prompts." Could a subscription to Spotify be all that's required to avoid liability? Alsup's ruling does have some limitations, as it applies only to the training of AI models. Alsup is open to the possibility that the output of the models could violate copyright law if it were too similar to the original work. Alsup does not consider that issue with regard to Anthropic because the plaintiffs "do not allege… that any LLM outputs infringing upon their works ever reached users of the public-facing Claude service." The judge also says that Anthropic created filters to ensure "no exact copy, nor any substantial knock-off" from being delivered to users. That, of course, entirely depends on your interpretation of the word "substantial." That argument could carry the day, however, in a lawsuit filed earlier this month by Disney and NBCUniversal against Midjourney, another AI tool that creates images. The lawsuit alleges that Midjourney's output displays "AI-generated images of their copyrighted characters." |