Last month brought news of a copyright dispute that could signal a seismic shift in the dynamics between the generative AI space and the music industry.
Universal Music Publishing Group sued multi-billion-dollar-backed AI company Anthropic for the alleged “systematic and widespread infringement of their copyrighted song lyrics” via its chatbot Claude.
The suit, filed by UMPG along with co-plaintiffs Concord Music Group and ABKCO, claims that “in the process of building and operating AI models, Anthropic unlawfully copies and disseminates vast amounts of copyrighted works — including the lyrics to myriad musical compositions owned or controlled by Publishers”.
UMPG et al’s lawsuit seeks potentially tens of millions of dollars in damages from Anthropic, but perhaps more significant is that the outcome of the case could set a major legal precedent for AI companies’ use of copyrighted lyrics on their platforms.
We won’t know that outcome for some time yet, but details published within a filing from Anthropic with the US Copyright Office last week could be an early indicator of the stance the AI firm is planning to take in its copyright battle with the publishers.
Back in August, The United States Copyright Office (USCO) issued a notice of inquiry (NOI) in the Federal Register on the topic of copyright and AI and alongside that announced a study around copyright law and policy issues raised by artificial intelligence systems.
In order to inform the study and “help assess whether legislative or regulatory steps in this area are warranted”, the USCO asked for written comment on these issues, “including those involved in the use of copyrighted works to train AI models, the appropriate levels of transparency and disclosure with respect to the use of copyrighted works, and the legal status of AIgenerated outputs”.
Amongst the companies that submitted written responses as part of the study include tech giants like Meta, Google and Adobe, as well as prominent AI firms like Stability AI and Anthropic.
The Verge has published a roundup of some of the key arguments put forward by these companies regarding the relationship between copyrighted content and the training of datasets used by generative AI.
According to UMPG et al’s lawsuit last month, which you can read in full here, Anthropic infringes the music companies’ copyrights by “scraping and ingesting massive amounts of text from the internet and potentially other sources, and then using that vast corpus to train its AI models and generate output based on this copied text”.
Anthropic explains in its recent USCO filing, which you can read here (and which we must stress is not connected to last month’s lawsuit), that its Claude chatbot “is trained using data from publicly available information on the Internet as of December 2022, non-public datasets that we commercially obtain from third parties, data that our users or companies hired to provide data labeling and creation services voluntarily create and provide, and data we generate internally”.
The company also claims that it “operates its crawling system transparently,” which it claims, “means website operators can easily identify Anthropic visits and signal their preferences to Anthropic”.
Furthermore, Anthropic says that Claude is trained using “Constitutional AI” which, it explains, means that its “model chooses the best output based on a clearly defined, explicit set of values-based instructions” by the user.
It adds: “We have worked to incorporate respect for copyright into the design of Claude in a foundational way. We don’t believe users should be able to create outputs using Claude that infringe copyrighted works. That is not an intended or permitted use of this technology, and we take steps to prevent it.”
Here are some of Anthropic’s arguments about the relationship between generative AI and copyright law:
1. Anthropic argues that training LLMs using copyrighted material is ‘fair use’
Anthropic tells the USCO that “the way Claude was trained qualifies as a quintessentially lawful use of materials”.
Citing the US Copyright Act, the company argues that “copyright protects particular expressions, but does not extend ‘to any idea, procedure, process, system, method of operation, concept, principle, or discovery’.”
“The way Claude was trained qualifies as a quintessentially lawful use of materials.”
Anthropic adds: “For Claude, as discussed above, the training process makes copies of information for the purposes of performing a statistical analysis of the data.
“The copying is merely an intermediate step, extracting unprotectable elements about the entire corpus of works, in order to create new outputs. In this way, the use of the original copyrighted work is non-expressive; that is, it is not re-using the copyrighted expression to communicate it to users.
“To the extent copyrighted works are used in training data, it is for analysis (of statistical relationships between words and concepts) that is unrelated to any expressive purpose of the work.
“This sort of transformative use has been recognized as lawful in the past and should continue to be considered lawful in this case.”
Anthropic also cites various cases, which you can see on page 7 of its USCO filing here, that it argues, “have allowed copying works in order to create tools for searching across those works and to perform statistical analysis”.”
The filing adds: “The training process for Claude fits neatly within these same paradigms and is fair use. Training uses works in a highly transformative, non-expressive way; rather than replicating and expressing the pre-existing work itself.
“As discussed above, Claude is intended to help users produce new, distinct works and thus serves a different purpose from the pre-existing work.”
2. Anthropic does not believe that “direct, collective, or compulsory” licensing “is necessary per se” when it comes to training large language models.
Two of the questions Anthropic submitted written answers to were: “Is direct, collective, or compulsory licensing of copyrighted material practicable/economically feasible for training LLMs?”
Anthropic argues that “because training LLMs is a fair use, [it does] not believe that licensing is necessary per se”.
“Because training LLMs is a fair use, we do not believe that licensing is necessary per see.”
The company adds: “To be sure, for a variety of reasons, developers may choose to procure special access to or use of particular datasets as part of commercial transactions.
“However, a regime that always requires licensing for use of material in training would be inappropriate; it would, at a minimum, effectively lock up access to the vast majority of works, since most works are not actively managed and licensed in any way.”
Anthropic claims further that “as a public benefit corporation,” it is “open to engaging in further discussion of appropriate permission regimes”, but says that “policymakers should be aware of the significant practical challenges that a collective licensing regime would entail”.
Adds Anthopic: “Licensing training data still raises many questions and potential problems from both policy and practical perspectives given that models can be trained on substantial volumes of works.
“Requiring a license for non-expressive use of copyrighted works to train LLMs effectively means impeding use of ideas, facts, and other non-copyrightable material.”
3. Anthropic suggests that users could be liable for generative AI outputs that infringe copyrights
The response to this question to the USCO’s study might form a part of Anthropic’s defense in its legal dispute against UMG.
Question 25 asks: “Who should be liable for generative AI outputs that may infringe copyrights?”
According to Anthropic: “Generally, responsibility for a particular output will rest with the person who entered the prompt to generate it. That is, it is the user who engages in the relevant ‘volitional conduct’ to generate the output and thus will usually be the relevant actor for purposes of assessing direct infringement.”
“Generally, responsibility for a particular output will rest with the person who entered the prompt to generate it.”
Anthropic adds: “At the same time, courts also have tools to adjudicate whether a service provider (or others involved in development of an LLM) face secondary liability for the user’s conduct.
“While merely offering an LLM service (including doing so commercially) would not in and of itself generate liability, courts are well-equipped to examine particular circumstances where a service provider meets the relevant thresholds for secondary liability – i.e., whether the provider knows and materially contributes to the infringement; has the right and ability to control the act and directly financially benefits; or induces the infringement by clearly promoting use of its tool for infringing purposes.”
Anthropic explains further: “Claude employs a range of measures to inhibit the production of infringing outputs, including terminating accounts of repeat infringers or violators if we become aware of their infringing activities.
“We look forward to continued collaboration with content creators and others to ensure these measures to combat such uses are robust.”
If Anthropic does choose to use this user liability argument in the suit filed by UMPG, it might only get it so far.
That’s because one of the issues alleged in UMPG, Concord and ABKCO’s complaint iss that Anthropic’s AI models generate output containing the publishing companies’ lyrics “even when the models are not specifically asked to do so”.
The lawsuit claims that the Claude chatbot responds to various prompts that don’t specifically ask for the copyrighted lyrics “by generating output that nevertheless copies Publishers’ lyrics”.
Examples of such requests include asking the chatbot to “write a song about a certain topic, provide chord progressions for a given musical composition, or write poetry or short fiction in the style of a certain artist or songwriter”.Music Business Worldwide