AI's Employment of Copyrighted Material: Is It Legitimate Under 'Fair Use' or a Copyright Violation?

The use of copyrighted material to train Artificial Intelligence (AI) systems has become a hot topic in legal circles, with court rulings offering mixed verdicts. Generally, these decisions lean towards recognizing such training as fair use when it is transformative and is carried out on legally obtained works, benefiting both AI companies and the public by enabling innovation. However, creators and rights holders continue to voice concerns over unlicensed copying and compensation.

In recent U.S. court cases, including Meta's AI training case, the use of copyrighted books for AI training has been deemed fair use due to its transformative nature. The AI learns patterns rather than reproducing works verbatim, thereby enabling the generation of new, distinct content. Similar rulings have been made in cases involving Anthropic, although litigation on other issues persists.

However, creators and rights holders argue that training AI on their works without permission constitutes unlicensed copying, even if the final AI outputs do not replicate the originals. They contend that such use should require authorization or payment. Lawsuits have been filed by artists, authors, and publishers, alleging copyright infringement in AI training practices involving material scraped from the web or otherwise used without consent.

For the public and AI companies, these rulings facilitate continued advancement in AI technology by permitting training on vast datasets under fair use. This, proponents argue, is crucial for innovation, creativity, and the development of useful tools and content generation systems. However, the evolving legal landscape underscores risks and urges clearer regulations to balance fair use with protection of creators’ rights.

Early efforts to address this complex issue include open-source initiatives like "Do Not Train" metadata tags and registries where creators can declare their preferences. For instance, the Authors Guild has filed a lawsuit against OpenAI over the use of books without permission for training ChatGPT. Similarly, Getty Images sued Stability AI over copyright infringement claims related to the use of more than 12 million of its images without a license.

In the United States, Tennessee's ELVIS Act was passed to protect people's voices and likeness from being used without permission in AI-generated content. Meanwhile, several major lawsuits are testing the legality of using copyrighted material to train AI systems and the content AI generates under fair use. For example, Disney, Universal, and Warner Bros. filed a joint lawsuit against Fable Simulation over the use of copyrighted scripts, stills, and character designs to train an AI tool generating short animated videos.

Courts evaluate fair use through a four-part framework, and each factor raises unique and unresolved legal questions when applied to generative AI. The first factor, purpose and character of the use, is often the most contested. AI companies argue that training qualifies because it involves extracting patterns to build new systems, not reproducing or distributing the original work. Critics, however, argue that using copyrighted works in training datasets without permission or payment amounts to unlicensed copying.

The third factor, amount and substantiality of the portion used, typically weighs against fair use when AI models ingest entire works rather than excerpts. The fourth factor, effect of the use on the market, looks at whether the new use harms the market for the original work or its potential derivatives. Creators argue that AI-generated content could substitute for their work, especially when it imitates a distinctive style or is used commercially.

In some cases, the use of pirated websites to obtain copyrighted material for training AI systems could still constitute infringement. For instance, a federal judge ruled that Anthropic's use of copyrighted books to train its Claude AI model could qualify as fair use, but the company downloaded millions of books from pirate websites.

The EU's AI Act includes rules about data transparency and tracking where content comes from. New rules could compel companies to disclose the data they use, where they source it, and whether they obtained permission. Adobe, for instance, trains its Firefly model only on licensed or owned content.

Former President Donald Trump publicly rejected proposals requiring AI companies to compensate creators for training on their work. As this complex and unsettled environment continues to impact all stakeholders differently depending on the jurisdiction and case specifics, it remains a topic of ongoing debate and legal challenge.

In the ongoing debate about AI training and copyrighted material, AI companies argue that extracting patterns from such works for training purposes, like Adobe's use of licensed or owned content for its Firefly model, is justified as it contributes to innovation and technology advancement.
Creative professionals, however, contend that unauthorized use of their works during AI training, even if the generated content doesn't replicate the originals, constitutes unlicensed copying and should require authorization or payment to protect their rights.

AI's Employment of Copyrighted Material: Is It Legitimate Under 'Fair Use' or a Copyright Violation?