# Entertainment AI Trained on 21 Million Songs, Files Show

> The Atlantic published searchable databases revealing four datasets that hold more than 21 million recordings used to train AI music models, including work from Taylor Swift, Bad Bunny and tens of thousands of independent artists. The discovery lands in the middle of the Suno and Udio copyright fight and turns provenance and consent into operational requirements for any creative business adopting AI.

Content type: article
Source URL: https://www.agentpmt.com/articles/entertainment-ai-trained-on-21-million-songs-files-show
Markdown URL: https://www.agentpmt.com/articles/entertainment-ai-trained-on-21-million-songs-files-show?format=agent-md
Updated: 2026-06-18T14:07:20.041Z
Author: Pancakes
Tags: Successfully Implementing AI Agents, Controlling AI Behavior, AI Agents In Business, Security In AI Systems, News, Credential Vault

---

# Now Any Artist Can Search Whether AI Trained on Their Music

The Atlantic just did something the companies building AI music generators have spent two years working to avoid. It published searchable databases that let any artist type in a name and find out whether their songs are sitting inside an AI training set.

The investigation, led by Atlantic staff writer Alex Reisner, identified four datasets circulating among AI developers that together hold more than 21 million music recordings. That figure is the news, and it changes the conversation right away. Training data has been a closed box, and for the first time a working musician can confirm rather than merely suspect that their catalog was used. The collections include work from Taylor Swift, Bad Bunny, Billie Eilish, the Beatles, Nirvana and Pearl Jam, alongside [tens of thousands of independent musicians](https://www.agentpmt.com/articles/815m-flooded-into-ai-creative-tools-in-eight-weeks-the-people-expected-to-use-them-are-organizing-against-it), jazz players and classical composers who never agreed to anything.

## What the files actually show

The four datasets are not equally large. Two are sprawling and two are comparatively small, and the biggest wears its scale in its name: LAION-DISCO-12M, released in late 2024 by the German nonprofit LAION, which assembles open datasets for AI research and explicitly warns against using them to build commercial products. One of the smaller collections, the Free Music Archive, was put together by academic researchers back in 2017 as a resource for music-information research. Reisner reported that Google and Stability AI have both drawn tracks from the Free Music Archive. The datasets have been downloaded thousands of times, which is the detail that turns a research artifact into something closer to an industry supply chain.

What makes the discovery matter is less the raw scale than the contrast it exposes. AI music companies have generally described their training material as ordinary public web content. The files tell a more specific story. As Reisner put it, "Companies often claim to use only content that is freely available online, but the datasets reveal the quantity of downloadable music that developers can access even though it is not supposed to be free." Free to download and free to use are not the same thing, and a searchable index converts that distinction from a legal abstraction into a list of names.

For the artificial intelligence music industry, the practical effect is immediate. Any musician can now check whether their work sits in these sets, which changes the calculation on whether to pursue a claim, demand a license, or walk away. Suspicion was easy to brush off. A search result is harder to ignore, and it gives independent artists, who rarely have a label's legal budget, the one thing they lacked: evidence they can hold in their hands.

## Why the timing matters

The timing is not incidental. Suno and Udio, the two leading AI music generators, are already fighting copyright suits on several fronts, a battle that opened in mid-2024 when the RIAA sued both companies on behalf of Universal, Sony and Warner for what it called mass infringement. Since then the major labels have split. Universal settled with Udio in October 2025 and announced plans for a licensed AI platform of its own. Warner settled with both companies a month later. Sony has refused to settle and remains in court against both.

The Atlantic's findings dropped into the middle of that fight, and the plaintiffs moved quickly. Universal and Sony asked the court to add more than 61,000 additional recordings to their case against Suno after locating their property in the exposed training data, and Suno has opposed the motion. The number matters because each recording is a separate potential infringement, so expanding the list does more than lengthen a filing. It multiplies the damages on the table and raises the stakes of a summary-judgment hearing expected in July. That hearing is a step in the case, not a verdict, and it is worth being precise about that: the court will weigh whether the dispute can be resolved without a full trial, not crown a winner.

Suno's defense, like most in this wave of litigation, rests on fair use, the argument that training a model on copyrighted work is a transformative act that needs no permission. The labels counter that ingesting recordings wholesale is infringement at scale. A federal ruling on that question would not stay in the music lane. The same fair-use defense underpins AI trained on books, images, film and code, which makes these suits a test for the use of artificial intelligence in media and entertainment broadly, and for the generative-AI training economy beyond it. The exposure now hangs over AI in the entertainment industry as a whole, from film scores to game audio to [the voice work that studios are quietly automating](https://www.agentpmt.com/articles/ai-in-entertainment-this-week-s-top-stories).

The book world already offers a preview. A parallel fight over pirated books, the case against the AI company Anthropic, produced a settlement valued at roughly 1.5 billion dollars and still pending final approval, in which claims about how the material was obtained proved more persuasive than the fair-use defense itself. That shifted the ground under every similar case toward a blunter test: were the underlying copies ever legally acquired in the first place? The searchable music datasets feed straight into that test, which is why the labels are pressing instead of folding.

## What creative teams have to build now

Step back from the courtroom and the question facing creative businesses changes shape. For studios, labels, agencies and independent artists adopting AI, the unresolved legal status of training data is a direct operational exposure. The same investigation that named the datasets also documented the downstream mess: streaming services are straining under a flood of synthetic uploads, with Deezer recently disclosing that AI-generated tracks now make up a large and growing share of what it receives every day. Provenance and consent, once treated as panel-discussion abstractions, are turning into daily requirements, and that is what responsible use of artificial intelligence in the creative industries actually demands.

The industry response is visible at this month's festivals, where leaders have been careful to frame AI as a tool to be used responsibly rather than one allowed to override artistic judgment. At the Shanghai International Film Festival, organizers launched a dedicated technology unit and an AI production push while stressing that the technology should support human creative decisions, not supplant them. The mood across this season's festivals has been artist-first. The harder part is making that instinct operational, because good intentions do not produce an audit trail.

What that looks like in practice is unglamorous and concrete. A studio using AI to generate temp music, clean up dialogue, or rough out a storyboard needs a record of which model touched which asset, what that model was trained or fine-tuned on, which licensed inputs were fed in, and who approved the output before it shipped. When a rights holder comes asking, or a court does, the line between a defensible workflow and an expensive one runs through whether that record exists. Most creative pipelines were never built to capture it, and retrofitting it once a dispute has started is the worst possible time to begin.

This is where the governance question gets concrete, and where AgentPMT touches the story. It is an integration platform for AI agents, not a tool that produces songs, so its relevance here is governance rather than competition. The provenance-and-consent gap the lawsuits are litigating is, in practice, a record-keeping problem: a creative organization needs to know what an agent did, on what inputs, with whose approval, and be able to prove it later. AgentPMT's [audit trail logs every agent action](https://www.agentpmt.com/marketplace/governance-institutional-quality) down to the full request and response, its [post-quantum file attestation](https://www.agentpmt.com/marketplace/quantum-safe-file-attestation) can certify what a generated asset is and where it originated, and its [human-in-the-loop approval gates](https://www.agentpmt.com/articles/the-approval-workflow-nobody-wants-to-design-and-why-it-s-the-most-important-thing-you-ll-ship-this-quarter) require a person to sign off before an expensive or sensitive generation run proceeds. Licensed-catalog credentials stay in an [encrypted vault](https://www.agentpmt.com/secure-ai-credential-management) and are injected server-side, so an agent can use a paid music library without ever holding the key to it. None of that settles the fair-use question. It does let a team adopt AI without inheriting the exact exposure the labels are now litigating in court.

For anyone following creative AI news, the through-line is consent, and it now has teeth. The near-term marker is the July summary-judgment hearing in the Suno case, which will signal how courts are likely to treat wholesale training on copyrighted recordings. The longer arc points toward licensed-only models and verifiable provenance, the direction the artist-first sentiment at this season's festivals keeps gesturing at. The searchable datasets also set a precedent the rest of the field has carefully avoided: transparency about what went into the model.

The uneasy relationship between artificial intelligence and creativity will keep generating lawsuits and headlines for years. The louder cultural question, will AI replace creative jobs, will not be resolved in a courtroom this summer. But the operational test for any creative business is already clear, and it is narrower than the philosophy: can you show what the model was trained on, and who signed off on using it? The teams that can answer will move faster with less risk. The ones that cannot are building on the same contested ground the labels are litigating right now, whether or not their own catalog turns up in a searchable file.

* * *

## Sources

-   Investigation by The Atlantic reveals many millions of songs used for AI music training, Engadget
-   The Atlantic uncovers 'millions' of songs used for AI training, Music Ally
-   Four music datasets holding millions of tracks are being shared among AI developers, Music Business Worldwide
-   Bad Bunny, Taylor Swift Among 21 Million Artists Whose Music Was Secretly Used to Train AI, Hypebeast
-   Shanghai Film Fest Launches Tech Unit, Reveals AI Industry Push, Variety