Should Copyright be Enforced on Data Used in AI Training? The Two Sides

The two sides to the copyright problem — a closer look.

The question has two answers:

No: Model training uses a process that’s quite lossy (original details are not preserved). Any LLM is, additionally, a derived work, and therefore, its output is fair use.
- The way you use or distribute is a different question — if you intentionally or significantly infringe upon someone’s copyright and stand to profit from it, such as using data from other books to write your own or cloning copyrighted code and then distributing your output, it should be a violation of copyright.
Yes: All training data does ultimately provide value, leading up to the final output. People have spent their time for the art, literature, and code, for example. Simply, if those people didn’t do all this, the model couldn’t exist in the first place. The original author should always be compensated and be able to decide whether they want their work to be used in a dataset or not.
- A lot of generative AI users can’t even know if they are infringing upon copyright, especially when there’s no way to put an image through a process, let’s say, and see if it resembles a copyrighted image. When the model itself never points you to the source, the end user could be unknowingly plagiarizing someone else’s work.

What we need to understand, first of all, is that copyright is all about distribution (and reproduction and exhibition) and not creation. These models create and not distribute. For example, you could read all of Tolkien’s work and write a story that uses the same characters. It’s fine as long as you don’t get it printed for sale. Similarly, if an AI tool gives you a story using Middle-earth characters, it’s not copyright infringement until you decide to profit from it.

Models could themselves be called derived work if they are trained exclusively on copyrighted material. In this case, all output is also derived work, even if it’s spitting out something entirely different. This line of reasoning is hard to follow because the training datasets are not isolated and often include vast amounts of open-source information mixed with copyrighted material.

Is it just “being inspired”?

Technically, one could argue that people have always learned from others. Especially in art, literature, music, and even software development, people have been inspired by others’ work heavily. Being inspired knowingly or being influenced subconsciously, at this point we’re going to consider both the same.

It’s true that an organization’s ability to profit off of someone else’s work should be limited — but if we follow this reasoning completely, then it also means that the public should be given free rein to use the trained model’s output in any way it likes. After all, it’s also learning and being inspired, isn’t it?

However, that brings a bigger question. That’s typically how we have learned. LLMs for generative AI tools are not us. They are non-human.

Whereas we learn from others and are influenced by other people’s work, such as books, art, cinema, code, music, or games, and are free to create something based on that — an AI’s LLM is a vastly superior construct. It learns at an insane speed and can gulp down an entire library section’s worth of data in a matter of a few seconds to create conclusively better and more accurate work, all for profit.

As such, shouldn’t LLMs be limited and not be afforded the same liberties that the slower humans are?

Particularly, the fact that it’s mostly private corporations like Google, Meta, Microsoft, and OpenAI doing this training and not solo researchers, scientists, and programmers further accentuates the problem. All of an LLM’s training, no matter whether it’s a model from Google or from a startup that began yesterday, is done to make a profit for a company.

The speed is the only matter here.

Hypothetically, a musician could create an optimized process of listening to other artists, taking faster notes, and doing faster research to then create their own songs, being inspired by all that (which is fair and doesn’t infringe upon copyright).

This is still nowhere near the speed at which an LLM is trained. There is no exhaustion or upper cap, and they are always built by corporations.

Though it’s true that the power of the mind and its neurons can be comparable in some ways to an LLM, it’s safe to say that the majority of humans don’t, can’t, or won’t undertake a rigorous self-training program to learn a music artist, remember a genre of literature, or somehow store thousands of Wikipedia articles in their mind to produce better output.

The speed argument, therefore, stands.

AI isn’t an author, but the work of an author

After being developed, no AI model went to look out for new sources in order to learn more about a topic. It’s always a human that feeds it the datasets. In other words, AI models are not reading books. Humans are choosing and feeding the books to AI models, not reading them all themselves.

So, it effectively makes a generative AI tool, which uses this model to give access or sell the output to end users, a work by an author (the human) and not the author of the work.

But neither the AI model nor the human developer are actually getting inspired or influenced by the books in the dataset. Neither is “reading” anything. The developer is feeding the pages, so to speak, to the model and the model is running algorithms on it. When the algorithms are done running, the developer is making money off the books written by someone else without paying them.

Somehow, we keep coming back to the “profit” and the “corporation” aspect being the root of the evil. Is that so?

What if there was no profit?

If individual creators were doing the heavy lifting or if all LLMs were open-source with no company more likely to make profits than another, the arguments would’ve been much less harsh.

When a company copies files off the internet and the literary world, trains a model, and basically sells its generative capabilities to others, it appears evil.

But humans have done it historically.

Andrzej Sapkowski wrote The Witcher series of books. Like many before him and after him, he also took elements from folklore and myths that were originally created or told by others. His novels are making money without compensating anyone and he is selling rights to his lore to the game studio CD Projekt Red and to Netflix for further adaptations.

Where do we draw the line then? When does it stop being an infringement?

More realistically, Disney took the myths of the Brothers Grimm and enforced copyright on their interpretations. How fair is that?

The copyright laws of the majority of the developed world including the US and the EU are pretty detailed but some scope of ambiguity is unavoidable. With just the right precedent and the right modifications, a work can go from being a blatant copy of somebody’s work to being simply inspired.

So, is it just about the profit-making here? Otherwise, would we be more willing to lump AI models with humans and be done with it?

Well, that seems unlikely.

Big corporations can always take these hypothetical open-source LLMs and add their own spin, such as integration with their current ecosystems or bonus features. There’s a reason why even on low-end PCs you see Windows and not Ubuntu, or why Internet Explorer could grow so much despite better alternatives.

Marketing and existing customers are both resources that corporations can use even if the models are not their own creations.

Is there a solution?

We have to leave it to the policymakers. But it’s true that there needs to be more awareness, especially from the side of the artists, writers, and everyone in between. The capabilities of AI have increased by a great deal to the point that artists, from painters and musicians to game designers and video editors, are feeling threatened.

When an artist’s work is used without permission or compensation, it raises a number of ethical questions. Is a time coming when artists will not be needed? As much as we prize individual creativity and the human ability to create in general, mass layoffs are nobody’s fantasy, yet all too real, looming large, and something corporate denial will not be able to fix.

Going forward, patents seem to be a better option to safeguard creation but they offer a shorter protection period (21 years until public domain). Effectively, patents block somebody else from extracting any kind of value from a creation, whereas copyright laws don’t. As far as copyright laws are concerned, for the most part, you can extract value (let’s say in a lossy way, as is the case in these LLMs) and do whatever you want to do with it.

To compare the human learning process to a model is simply incorrect. First, humans learn differently and often mix ideas in creative ways. Second, the speed is far inferior for any practical use outside of putting together your own horror movie after watching a bunch of horror movies during your childhood or something to that effect. Third, no copyright law in the world makes it illegal to use a human brain to look at a work and be influenced by it, whereas strict copyright laws apply to software and code, including algorithms.

If I had to conclude it all, I’d say that as long as there are ethical, moral, and economic incentives that don’t decrease the market value of a human being’s work, a model should be allowed to do whatever it wants to do, including making a profit. A human being’s work is subjective as well — a writer spending years refining their skill and pouring their soul into a body of work that’s celebrated globally, for example, should be more valuable than a piece of code that’s derivative itself.