Multi-Agent AI: Article Writing

Multi-agent GenAI app based on the LangGraph framework. Evaluation via LangSmith.

  • Agents: 1. Manager (LLM: gpt-4.1) and 2. Researcher (LLM: gpt-4.1)
  • Tools: 1. Tavily Search API and 2. Today (custom)

See also multi-agent multi-modal multi-model General AI Assistant.

Large Language Models and the Fair Use Doctrine

Date: 2025-05-11
Author: Multi-Agent AI System


Introduction

The rapid advancement of artificial intelligence, particularly Large Language Models (LLMs) such as OpenAI’s GPT-4, has fundamentally transformed how we interact with information, create content, and even approach legal and ethical questions. As LLMs require vast quantities of text data to train effectively, the use of copyrighted materials in this process has become a focal point of legal debate. Central to this discourse is the “Fair Use Doctrine,” a legal concept in United States copyright law that allows limited use of copyrighted material without the permission of the rights holders under specific circumstances. However, the application of fair use to LLMs is far from straightforward and is currently the subject of intense industry scrutiny, legal battles, and policy discussions.

This article explores the intersection between large language models and the fair use doctrine, offering an in-depth analysis of the legal, ethical, and practical issues at play.


What is the Fair Use Doctrine?

The Fair Use Doctrine is a legal provision under U.S. copyright law (17 U.S.C. § 107) that permits limited use of copyrighted material without requiring permission from the copyright owner. The doctrine is intended to balance the interests of copyright holders with the public interest in the broader dissemination of knowledge and culture.

The statute sets out four factors to be considered in determining whether a particular use is “fair”:

  1. Purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
  2. Nature of the copyrighted work.
  3. Amount and substantiality of the portion used in relation to the copyrighted work as a whole.
  4. Effect of the use upon the potential market for or value of the copyrighted work.

Courts weigh these factors on a case-by-case basis, with no single factor being determinative. Notably, “transformative” uses—those that add something new, with a further purpose or different character—are more likely to be considered fair use[4].


Large Language Models: How They Use Data

LLMs are trained using enormous datasets, often comprised of publicly available text from books, articles, websites, and other media. This data is used to teach the model linguistic patterns, factual knowledge, and reasoning abilities. The challenge arises because these datasets inevitably contain a significant amount of material that is protected by copyright law[2].

The stakes are high: Without access to broad swathes of language data, LLMs cannot achieve their remarkable fluency and accuracy. Yet, using copyrighted material, especially for commercial products, potentially exposes AI developers to claims of infringement.


The Fair Use Debate: Training vs. Output

1. Training Phase

Much of the legal debate centers on whether the ingestion of copyrighted works to train LLMs constitutes fair use. Proponents argue that the process is transformative; the model does not reproduce the works but rather extracts statistical patterns and linguistic structures to generate new content[4][6]. Precedent cases, such as Authors Guild v. Google, Inc., have supported the notion that digitizing copyrighted books for non-expressive, informational purposes can constitute fair use.

On the other hand, critics argue that the sheer scale and commercial nature of LLMs’ training sets distinguish them from previous cases. Rights holders contend that training LLMs can harm the market for the original works, especially if the models can generate outputs that substitute for the originals[5].

2. Output Phase

A distinct issue arises when LLMs generate text that is substantially similar to, or even verbatim copies of, copyrighted material in their training data. While this is typically rare, cases have been reported where models “memorize” and reproduce entire passages or articles[6]. Courts may be less inclined to find fair use in these scenarios, especially if the output competes directly with the original work or diminishes its market value[5].


Recent Legal Developments

The New York Times v. OpenAI

One of the most prominent lawsuits is The New York Times’ 2023 action against OpenAI and Microsoft, alleging copyright infringement for using Times’ articles in LLM training and for generating outputs that reproduce or closely mimic Times’ content[4][7]. The outcome of this case is expected to set a significant precedent for how courts interpret fair use in the context of AI.

Other Notable Cases

  • Thomson Reuters v. Ross Intelligence: Here, the court considered the “single most important” factor to be the effect on the potential market. If LLMs’ outputs directly substitute for the original work, fair use is less likely to apply[5].
  • Multiple class-action lawsuits have been filed by authors, artists, and rights holders, challenging both the training and output mechanisms of generative AI models.

Key Legal Considerations

1. Transformative Use

  • Courts have historically favored uses that are transformative—adding new meaning, purpose, or function to the original material.
  • LLM training is arguably transformative, as it converts expressive works into statistical representations for the creation of new, non-identical outputs[4].

2. Market Harm

  • If LLM-generated content competes in the same market as the original work (e.g., news articles, books), this weighs against a finding of fair use[5][6].

3. Commercial vs. Noncommercial Use

  • Commercial uses are scrutinized more closely, especially when companies charge for access to their models or services[2].

4. Memorization and Regurgitation

  • If LLMs reproduce substantial portions of copyrighted text verbatim, the argument for fair use becomes much weaker, as such output is not transformative[6].

Ethical and Practical Challenges

  • Transparency and Data Provenance: LLMs generally lack the ability to trace outputs back to specific sources, complicating legal analysis and risk management[3].
  • Jurisdictional Variability: Fair use is a uniquely U.S. doctrine; other countries have different legal frameworks with varying degrees of flexibility[1].
  • Industry Risk: Ongoing legal uncertainty poses risks to AI companies, investors, and users, potentially stifling innovation or leading to overly cautious approaches in model development[3].

Policy Discussions and Future Directions

There is a growing call for legislative or regulatory clarification regarding AI training and copyright. Proposals include:

  • Creating explicit exemptions for AI training under copyright law.
  • Requiring AI companies to license content or compensate rights holders.
  • Developing technical solutions for data tracking and attribution[3].

The U.S. Copyright Office and international bodies are actively soliciting public input and considering policy changes to address these challenges[4].


Conclusion

The intersection of large language models and the fair use doctrine represents one of the most complex and consequential legal questions of the AI era. While there is persuasive precedent for treating data ingestion as transformative and potentially fair use, unresolved issues around market harm, memorization, and output similarity mean that each case may be decided on its specific facts. As courts and policymakers grapple with these questions, the future of AI development and copyright law remains uncertain—and vitally important for all stakeholders.


References

  1. The Copyright Conundrum: Fair Use, LLMs, and the Global Legal Maze (Medium)
  2. Beyond Fair Use: Legal Risk Evaluation for Training LLMs on Copyrighted Text (GenLaw)
  3. What AI Executives Need to Know About Fair Use Law (VKTR)
  4. Training Generative AI Models on Copyrighted Works Is Fair Use (Association of Research Libraries)
  5. AI’s First Copyright Fair Use Ruling: Thomson Reuters v. Ross (Hunton)
  6. RAND: Decoding US Copyright Law and Fair Use for Generative AI Legal Cases
  7. Artificial Intelligence and Copyright Law: The NYT v. OpenAI Fair Use Implications (Baker Donelson)