Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
I’ll train my AI on just the bee movie. Then I’m going to ask it “can you make me a movie about bees”? When it spits the whole movie, I can just watch it or sell it or whatever, it was a creation of my AI, which learned just like any human would! Of course I didn’t even pay for the original copy to train my AI, it’s for learning purposes, and learning should be a basic human right!
In the meantime I’ll introduce myself into the servers of large corporations and read their emails, codebase, teams and strategic analysis, it’s just learning!
That would be like you writing out the bee movie yourself after memorizing the whole movie and claiming it is your own idea or using it as proof that humans memorizing a movie is violating copyright. Just because an AI is violating copyright by outputting the whole bee movie, it doesn’t mean training the AI on copyright stuff is violating copyright.
Let’s just punish the AI companies for outputting copyright stuff instead of for training with them. Maybe that way they would actually go out of their way to make their LLM intelligent enough to not spit out copyrighted content.
Or, we can just make it so that any output made by an AI that is trained on copyrighted stuff cannot be copyrighted.
If the solution is making the output non-copyrighted it fixes nothing. You can sell the pirating machine on a subscription. And it’s not like Netflix where the content ends when the subscription ends, you have already downloaded all the not-copyrighted content you wanted, and the internet would be full of non-copyrighted AI output.
Instead of selling the bee movie, you sell a bee movie maker, and a spiderman maker, and a titanic maker.
Sure, file a copyright infringement each time you manage to make an AI output copyrighted content. Just run it on a loop and it’s a money making machine. That’s fine by me.
Yeah, because running the AI also have some cost, so you are selling the subscription to run the AI on their server, not it’s output.
I’m not sure what is the legality of selling a bee movie maker, so you’d have to research that one yourself.
It’s not really a money making machine if you lose more money running the AI on your server farm, but whatever floats your boat. Also, there are already lawsuits based on outputs created from chatgpt, so it is exactly what is already happening.
Yeah, making sandwiches also costs money! I have to pay my sandwich making employees to keep the business profitable! How do they expect me to pay for the cheese?
EDIT: also, you completely missed my point. The money making machine is the AI because the copyright owners could just use them every time it produces copyright-protected material if we decided to take that route, which is what the parent comment suggested.
They should pay for the cheese, I’m not arguing against that, but they should be paying it the same amount as a normal human would if they want access to that cheese. No extra fees for access to copyrighted material if you want to use it to train AI vs wanting to consume it yourself.
And I didn’t miss your point. My point was that the reality is already occurring since people are already suing OpenAI for ChatGPT outputs that the people suing are generating themselves, so it’s no longer just a hypothetical. We’ll see if it is a money making machine for them or will they just waste their resources from doing that.
Media is not exactly like cheese though. With cheese, you buy it and it’s yours. Media, however, is protected by copyright. When you watch a movie, you are given a license to watch the movie.
When an AI watches a movie, it’s not really watching it, it’s doing a different action. If the license of the movie says “you can’t use this license to train AI, use the other (more expensive) license for such purposes”, then AIs have extra fees to access the content that humans don’t have to pay.
Both humans and AI consume the content, even if they do not do so in the exact same way. I don’t see the need to differentiate that. It’s not like we have any idea of the mechanism by which humans consume a content to make the differentiation in the first place.
There is actually already a website where people just recreated the bee movie by hand so idk it might actually work as a legal argument.
I don’t think that’s a feasible dream in our current system. They’ll just lobby for it, some senators will say something akin to “art should have been always a hobby, not a profession”, then make adjustments for the current copyright laws so that they can be copyrighted.
learning should be a basic human right!
Education is a basic human right (except maybe in Usa, then it should be one there)
Yeah. A human right.
I am thrilled to see the output you get!
I don’t think LLMs should be taken down, it would be impossible for that to happen. I do, however think it should be forced into open source.
No but you would definitely design a car based on other designs made before.
Let’s engage in a little fantasy. Someone invents a magic machine that is able to duplicate apartments, condos, houses, … You want to live in New York? You can copy yourself a penthouse overlooking the Central Park for just a few cents. It’s magic. You don’t need space. It’s all in a pocket dimension like the Tardis or whatever. Awesome, right? Of course, not everyone would like that. The owner of that penthouse, for one. Their multi-million dollar investment is suddenly almost worthless. They would certainly demand that you must not copy their property without consent. And so would a lot of people. And what about the poor construction workers, ask the owners of constructions companies? And who will pay to have any new house built?
So in this fantasy story, the government goes and bans the magic copy machine. Taxes are raised to create a big new police bureau to monitor the country and to make sure that no one use such a machine without a license.
That’s turned from magical wish fulfillment into a dystopian story. A society that rejects living in a rent-free wonderland but instead chooses to make itself poor. People work to ensure poverty, not to create wealth.
You get that I’m talking about data, information, knowledge. The first magic machine was the printing press. Now we have computers and the Internet.
I’m not talking about a utopian vision here. Facts, scientific theories, mathematical theorems, … All such is free for all. Inventors can get patents, but only for 20 years and only if they publish them. They can keep their invention secret and take their chances. But if they want a government enforced monopoly, they must publish their inventions so that others may learn from it.
In the US, that’s how the Constitution demands it. The copyright clause: [The United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Cutting down on Fair Use makes everyone poorer and only a very few, very rich people richer. Have you ever thought about where the money goes if AI training requires a license?
For example, to Reddit, because Reddit has rights to all those posts. So do Facebook and Xitter. Of course, there’s also old money, like the NYT or Getty. The NYT has the rights to all their old issue about a century back. If AI training requires a license, they can sell all their old newspapers again. That’s pure profit. Do you think they will their employees raises out of the pure goodness of their heart if they win their lawsuits? They have no legal or economics reason to do so. The belief that this would happen is trickle-down economics.
I thought the larger point was that they’re using plenty of sources that do not lie in the public domain. Like if I download a textbook to read for a class instead of buying it - I could be proscecuted for stealing. And they’ve downloaded and read millions of books without paying for them.
And they've downloaded and read millions of books without paying for them.
Do you have a source on that?
Most AI models used Books3 as part of their dataset which is a collection of pirated books. Here are a few articles talking about it:
https://www.theverge.com/2024/8/20/24224450/anthropic-copyright-lawsuit-pirated-books-ai
https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/
Thank you
Maybe if you would pay for training data they would let you use copyright data or something?
Had the company paid for the training data and/or left it as voluntary, there would be less of a problem with it to begin with.
Part of the problem is that they didn’t, but are still using it for commercial purposes.
Their business strategy is built on top of assumption they won’t. They don’t want this door opened at all. It was a great deal for Google to buy Reddit’s data for some $mil., because it is a huge collection behind one entity. Now imagine communicating to each individual site owner whose resources they scrapped.
If that could’ve been how it started, the development of these AI tools could be much slower because of (1) data being added to the bunch only after an agreement, (2) more expenses meaning less money for hardware expansion and (3) investors and companies being less hyped up about that thing because it doesn’t grow like a mushroom cloud while following legal procedures. Also, (4) the ability to investigate and collect a public list of what sites they have agreement with is pretty damning making it’s own news stories and conflicts.
Not even stealing cheese to run a sandwich shop.
Stealing cheese to melt it all together and run a cheese shop that undercuts the original cheese shops they stole from.
Whatever happened to copying isn’t stealing?
I think the crux of the conversation is whether or not the world is better with ChatGPT. I say yes. We can tackle the disinformation in another effort.
When you copy to consume yourself it’s way different than when you copy to sell the copy for a lower price.
They’re not selling the copy, bruh. They’re selling a technology that very few understand. Smart people pretend they get it, but they don’t. That’s how rare the math is.
So because you don’t understand it, everything it does should be legal?
It’s not rare maths. There are trns of thousands of AI experts. And most CS graduates (millions) have a good understanding on how they work, just not the specifics of the maths.
Yeah, they’re not selling a copy, they are just selling a subscription to a copying machine loaded with the information needed to make a copy. Totally different.
I should start a business of printers and attach a USB with the PNG of a dollar bill. And of course my printers won’t have any government mandated firmware that disables printing fake money.
I’m not printing fake money! It’s my clients! Totally legal.
Though I am not a lawyer by training, I have been involved in such debates personally and professionally for many years. This post is unfortunately misguided. Copyright law makes concessions for education and creativity, including criticism and satire, because we recognize the value of such activities for human development. Debates over the excesses of copyright in the digital age were specifically about humans finding the application of copyright to the internet and all things digital too restrictive for their educational, creative, and yes, also their entertainment needs. So any anti-copyright arguments back then were in the spirit specifically protecting the average person and public-serving non-profit institutions, such as digital archives and libraries, from big copyright owners who would sue and lobby for total control over every file in their catalogue, sometimes in the process severely limiting human potential.
AI’s ingesting of text and other formats is “learning” in name only, a term borrowed by computer scientists to describe a purely computational process. It does not hold the same value socially or morally as the learning that humans require to function and progress individually and socially.
AI is not a person (unless we get definitive proof of a conscious AI, or are willing to grant every implementation of a statistical model personhood). Also AI it is not vital to human development and as such one could argue does not need special protections or special treatment to flourish. AI is a product, even more clearly so when it is proprietary and sold as a service.
Unlike past debates over copyright, this is not about protecting the little guy or organizations with a social mission from big corporate interests. It is the opposite. It is about big corporate interests turning human knowledge and creativity into a product they can then use to sell services to - and often to replace in their jobs - the very humans whose content they have ingested.
See, the tables are now turned and it is time to realize that copyright law, for all its faults, has never been only or primarily about protecting large copyright holders. It is also about protecting your average Joe from unauthorized uses of their work. More specifically uses that may cause damage, to the copyright owner or society at large. While a very imperfect mechanism, it is there for a reason, and its application need not be the end of AI. There’s a mechanism for individual copyright owners to grant rights to specific uses: it’s called licensing and should be mandatory in my view for the development of proprietary LLMs at least.
TL;DR: AI is not human, it is a product, one that may augment some tasks productively, but is also often aimed at replacing humans in their jobs - this makes all the difference in how we should balance rights and protections by law.
AI are people, my friend. /s
But, really, I think people should be able to run algorithms on whatever data they want. It’s whether the output is sufficiently different or “transformative” that matters (and other laws like using people’s likeness). Otherwise, I think the laws will get complex and nonsensical once you start adding special cases for “AI.” And I’d bet if new laws are written, they’d be written by lobbiests to further erode the threat of competition (from free software, for instance).
What do you think “ingesting” means if not learning?
Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue. AI is simply predicting the next token /word based on statistics.
You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?
There may be moral issues on training on someone’s hard gathered knowledge, but there is no legislature against it. Reading books and using that knowledge to provide information is legal. If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.
Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue.
Wrong. The infringement is in obtaining the data and presenting it to the AI model during the training process. It makes no difference that the original work is not retained in the model’s weights afterwards.
You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?
Yes, because copyright law is intended to benefit human creativity.
If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.
Wrong. Search engines retain a minimal amount of the indexed website’s data, and the purpose of the search engine is to generate traffic to the website, providing benefit for both the engine and the website (increased visibility, the opportunity to show ads to make money). Banning the use of copyrighted content for AI training (which uses the entire copyrighted work and whose purpose is to replace the organizations whose work is being used) will have no effect.
What do you mean that the search engines contain minimal amount of site’s data? Obviously it needs to index all contents to make it searchable. If you search for keywords within an article, you can find the article, therefore all of it needs to be indexed.
Indexing is nothing more than “presenting data to the algorithm” so it’d be against the law to index a site under your proposed legislation.
Wrong. The infringement is in obtaining the data and presenting it to the AI model during the training process. It makes no difference that the original work is not retained in the model’s weights afterwards.
This is an interesting take, I’d be inclined to agree, but you’re still facing the problem of how to distinguish training AI from indexing for search purposes. I’m afraid you can’t have it both ways.
Copyright laws protects the ability of copyright holder to make money. The laws were created before AI and now obviously have to be adapted to new technology (like you didn’t really need copyright before the invention of printing). How exactly AI will be regulated is in the end up to society to decide, which most likely will come down who has the better lobby.
I’m I the only person that remembers that it was “you wouldn’t steal a car” or has everyone just decided to pretend it was “you wouldn’t download a car” because that’s easier to dunk on.
You wouldn’t shoot a policeman and then steal his helmet.
These anti piracy commercials have gotten really mean.
People remember the parody, which is usually modified to be more recognizable. Like Darth Vader never said “Luke, I am your father”; in the movie it’s actually “No, I am your father”.
Maybe add a spoiler alert next time. Jeez.
Spoiler alert, but Rosebud was his sled all along.
I’m pretty sure it’s either Mandela Effect or a massive gaslighting conspiracy. Though I guess that’s true for everything that’s collectively misremembered.
Then OpenAI should pay for a copy, like we do.
Is their an official statement if OpenAI pays for at least one copy of whatever they throw into the bots?
Counteroffer. We eliminate copyright laws all together. For anyone and everyone.
Let move to a system in which we found the projects before their release. And once released they are available to everyone for free.
Also let’s make a system where everyone can work a basic work like 20-30 hours a week and get a living wage and the rest of the time we can just produce art of any kind of thing for free to anyone and we’ll already had our needs covered.
And free cotton candy and rainbows for everybody!
Even if you come to the conclusion that these models should be allowed to “learn” from copyrighted material, the issue is that they can and will reproduce copyrighted material.
They might not recreate a picture of Mickey Mouse that exists already, but they will draw a picture of Mickey Mouse. Just like I could, except I’m aware that I can’t monetize it in any way. Well, new Mickey Mouse.
This is an issue for the AI user though. And I do agree that needs to be more conscious in people’s minds. But I think time will change that. Perhaps when the photo camera came out there were some shmucks that took pictures of people’s artworks and claimed it as their own because the novelty of the technology allowed that for a bit, but eventually those people are properly differentiated from people properly using it.
Okay that’s just stupid. I’m really fond of AI but that’s just common Greed.
“Free the Serfs?! We can’t survive without their labor!!” “Stop Child labour?! We can’t survive without them!” “40 Hour Work Week?! We can’t survive without their 16 Hour work Days!”
If you can’t make profit yet, then fucking stop.
I hate to say this but “let the market decide” if Ai is something the consumer wants/needs they’ll pay for it otherwise let it die.