How text and data drive the future of copyright in AI
This is a golden opportunity for India’s policy makers to act with foresight emphasising the need for AI innovations and reinforcing India’s commitment to digital advancements

Generative AI has had a great impact on our lives in recent years, owing not only to its remarkable capabilities in processing and generating text but also in creating and manipulating images, videos, and audio content. This transformative power, however, is inherently dependent on learning statistical patterns from vast amounts of data, much of which is sourced from the web. In addition, proprietary datasets, licensed materials, and curated content also contribute to the training process. This reliance on large datasets raises significant concerns regarding intellectual property rights, as certain portions of the content used to train these models may be protected by copyright or other legal frameworks. As a result, efforts to advance AI innovations have frequently encountered barriers related to intellectual property, triggering major debates on the ethical and legal challenges of developing and refining large language models.
All major generative AI companies are currently involved in legal battles across the world. Many of these litigations take the form of class action lawsuits filed by copyright owners of published content like books, where allegations of systematic copyright infringement have been raised. In 2024, authors Andrea Bartz, Charles Graeber and Kirk Wallace Johnson filed a case against Anthropic, an AI company known for its Claude language models. The authors alleged that Anthropic trained its language models on millions of books, including their own, without permission or compensation.
In June 2025, US Court of Northern District of California ruled that Anthropic’s use of legally acquired books to train its AI models fell under fair use, since the AI training process did not simply reproduce the original works, but generated new content after learning statistical patterns. However, the judgement also noted that Anthropic’s act of building a central library using downloaded pirated copies could not be considered as fair use. This raises the potential for considerable damages if the company is found liable in a forthcoming trial expected later in 2025. The court certified the case as a class action which means that US authors whose works were allegedly used by Anthropic from pirated sources could be represented in the lawsuit. Such a class action could result in very serious financial implications for Anthropic as it is about to close a new $5 billion funding round that would eventually peg the valuation of the company at an astronomical $170 billion.
India is no exception. The Delhi High Court is adjudicating a major copyright case filed by news agency ANI against US based OpenAI. The High Court's decision is expected to provide clarity on several key concerns, in particular, the extent to which the use of copyrighted material in training Large Language Models (LLMs) may be construed as infringement in the Indian legal context. While the US court's decision on the Anthropic case may influence the proceedings of the ANI lawsuit, it is important to note that unlike the US law which relies on the principle of fair use, India’s approach is based on the principle of fair dealing specified under Section 52 of the Copyright Act 1957.
The Indian law contains more than 30 scenarios where the use of copyright-protected material will not amount to infringement, and the use of copyrighted content as training data for AI systems is not explicitly mentioned in any of these exceptions to infringement. However, India being a common law country provides sufficient leeway to courts to interpret the law for doing complete justice after taking into account the latest technological advancements and interests of all stakeholders. This is a golden opportunity for India’s policy makers to act with foresight and vision emphasising the need for innovations in the field of AI and reinforcing India’s commitment to digital advancements. This should involve a detailed public consultation phase which should pave the way for an amendment to India’s copyright law—an exception that would enable the use of copyright works for computational data analysis and processing without prior approval from copyright owners.
Such Text and Data Mining (TDM) exceptions are not a new concept. It is already being practised in many jurisdictions like the European Union, United Kingdom and Singapore while others such as Hong Kong are on the verge of implementing them. The exception would allow AI systems to learn on the basis of extraction and analysis of published text and data. It should cover both commercial and non-commercial organizations and as a consequence both public research institutions like IITs/IIMs/ IISc. etc. and AI companies should be able to derive the benefits.
At the same time such an exception should not unnecessarily prejudice the legitimate interests of copyrights owners. Thus, such an exception should not be applicable when infringing copies of content is used by AI companies or when licensing schemes are already available. Furthermore, organisations must maintain proper records of copyrighted works which are used to train AI models. In order to safeguard the rights of the copyright owners who want to expressly reserve their rights an opt-out mechanism can also be provided.
A TDM exception will foster increased access to copyrighted works for training and developing LLM models. When datasets used to train these models are enriched with a large corpus of copyrighted content it would eventually result in developing more powerful AI systems, particularly in a country like India with its multilingual landscape. It will also bring down transaction costs and uncertainty as it eliminates the need to obtain prior approval from different copyright owners. Policymakers should craft this TDM exception as a tool to attract more companies and talent capable of investing and engaging with the burgeoning Generative AI ecosystem in India. Such a step can go a long way in the advancing India as a global hub for innovation in Generative AI.
*Authors are faculty members at IIM Calcutta
First Published: Sep 15, 2025, 17:06
Subscribe Now