Four music datasets holding millions of tracks are being shared among AI developers, The Atlantic reports

Photo credit: Phonlamai Photo/ Shutterstock

Four datasets of music are circulating among artificial intelligence developers, and together they hold more than 21 million recordings, according to a report by The Atlantic.

The collections were identified by The Atlantic‘s Alex Reisner.

They are filled with copyrighted music, spanning household names and tens of thousands of lesser-known independent artists, according to the report.

Two of the datasets each contain more than 100,000 recordings, according to The Atlantic, while the other two are far larger, at roughly 9 million and 12 million tracks.

They include hits from major pop artists such as Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, and the Beatles, the report said, alongside jazz artists and classical composers.

All four have each been downloaded several thousand times, according to The Atlantic, though because the industry keeps its training data under wraps, it isn’t publicly known which companies have used most of them.

The Atlantic reported that Google and Stability AI have used tracks from one of the smaller collections, the Free Music Archive.

Two of the four datasets have publicly documented origins, and neither was created for the purpose of training commercial music generators.

The largest is LAION-DISCO-12M, a collection of more than 12 million tracks released in November 2024 by LAION, a German non-profit that compiles open datasets for AI research.

LAION, which is also behind the dataset used to train Stability AI‘s Stable Diffusion image generator, says the music collection was “released for research purposes” and is intended for use “in academic settings.”

But LAION explicitly warns against deploying its datasets commercially, or using them in their original form to create finished products.

The collection contains only links to publicly available YouTube tracks and their metadata, not the audio itself, and LAION says it does not distribute the music.

One of the roughly 100,000-track datasets is the Free Music Archive, published by academic researchers in 2017 as a resource for music-information-retrieval research – the field that studies how software searches, sorts, and analyzes music.

It draws on a library of the same name directed by WFMU, a freeform radio station in the US, whose catalog consists of tracks that artists released under permissive licenses for sharing.

Those Creative Commons licenses governed the free distribution of the music, and were in place long before the generative-AI tools now trained on such material.

AI music companies including Suno and Udio are now grappling with at least 12 lawsuits, according to The Atlantic.

The litigation from major music companies began in June 2024, when the RIAA, acting on behalf of Universal Music Group, Sony Music Entertainment, and Warner Music Group, sued both companies for what it called “mass infringement” of copyright.

The suits alleged that Suno and Udio had copied recordings to train their models without permission.

Since then, UMG and Warner have moved from litigation toward licensing, while Sony has remained in court against both companies.

Universal Music Group settled with Udio in October 2025, announcing a “compensatory legal settlement” plus new recorded-music and publishing licenses for a jointly developed AI platform set to launch in 2026.

Under that deal, Udio‘s service is being moved into what UMG called a “walled garden,” with fingerprinting and filtering applied before the new platform launches.

Warner Music Group reached its own settlement and licensing deal with Udio in November 2025, and days later became the first major to settle with Suno.

The WarnerSuno agreement, which the companies called a “first-of-its-kind partnership,” also saw Suno acquire the concert-discovery platform Songkick from WMG.

As part of that deal, Suno said it would launch “new, more advanced and licensed models” in 2026 and deprecate its current models, with downloads on its free tier replaced by playback and sharing.

The financial terms of the Suno and Udio settlements have not been disclosed.

In fact, Suno is fighting in court to keep the terms of its Warner settlement away from UMG and Sony, which remain active plaintiffs against the company.

Udio has since signed further licensing deals – with the independent-label body Merlin in January 2026 and with Kobalt in April 2026.

Sony Music is the one major still litigating against both Suno and Udio, while Germany’s GEMA and Denmark’s Koda are also suing Suno.

Independent artists have brought their own cases, and the American Federation of Musicians has sued UMG and Warner, alleging their members’ recordings were licensed to Suno and Udio without compensation or credit.

How much rightsholders are losing is hard to quantify, but a study commissioned by CISAC, the global body for authors’ societies, has tried to put a number on it.

The research, carried out by PMP Strategy, estimated that generative AI could take 24% of music creators’ revenues by 2028.

That equates to a cumulative loss of €10 billion ($10.5 billion) for creators between 2023 and 2028, rising to €4 billion a year by the end of that period – a figure that excludes record companies and publishers.

PMP Strategy attributed the growth of generative AI to “the use of copyrighted works for the training of their models.”

At the level of an individual act, the instrumental duo The American Dollar alleged in a May 2026 lawsuit that Suno had cut its licensing revenue by nearly 80%.

Their licensing revenue “has been nearly eliminated since the first version of Suno AI was made available to the public,” the complaint stated.

The flood is most visible on streaming services such as Deezer, where AI-generated tracks are arriving in volume.

Deezer said in April 2026 that it was receiving close to 75,000 fully AI-generated tracks a day – more than 44% of all new music uploaded to the platform.

That was up from 60,000 a day in January 2026 and just 10,000 when Deezer launched its detection tool in January 2025.

Consumption of that music remains low, at 1-3% of total streams, with 85% of those streams flagged as fraudulent and demonetized, according to Deezer.

“AI-generated music is now far from a marginal phenomenon,” said Alexis Lanternier, CEO of Deezer, which says it was the first streaming platform to detect and tag synthetic tracks at the platform level.

Qobuz, Apple Music, and Spotify have since introduced their own AI-tagging or disclosure measures.Music Business Worldwide

Related Posts