30 Indian Minds Leading the AI Revolution

Unravelling the ethics of large language models: Navigating concerns in ChatGPT and beyond

Here's a look at the evolution of the Internet and the growth of online content, leading to the emergence of search engines and the advancement of Natural Language Processing, leading to LLMs

By Avik Sarkar

Published: Jun 5, 2023 03:32:00 PM IST

LLMs would be great tools to enhance the productivity of people in their daily work. But, recently privacy concerns have been raised in Europe on using private publicly available data for building LLMs models. Image: Shutterstock

Large Language Models (LLMs) like ChatGPT by OpenAI or BARD by Google are artificial intelligence-based machine learning models trained on large amounts of text data to generate human-like language responses. Such LLMs would be great tools to enhance the productivity of people in their daily work. Recently, privacy concerns have been raised in Europe on using private publicly available data for building LLMs models. Soon, other nations will likely encounter similar problems, which might lead to blocking such applications. Concerns have also been raised on the legality of using the available online documents to learn LLMs and the possibility of monetising the information. In this article, we look at the evolution of the Internet and the growth of online content, leading to the emergence of search engines and the advancement of Natural Language Processing, leading to LLMs. The article highlights the ethical issues of LLMs and provides remedies to mitigate them.

The Internet revolution: Democratising information and the rise of large language models

The 1990 emergence of personal computers and the World Wide Web (WWW) led to the popularity of the Internet. For the first time in the history of civilization, information sharing and access were truly democratised as the Internet provided a platform for people to create and share content of various types (text, video, audio, and so on) on a massive scale. This led to the challenge of locating and finding the relevant information source of interest, with users relying on knowing the URLs of the specific websites. Initially, directory catalogues and hierarchy-based navigation with limited keyword search evolved as a plausible approach. It was only with the advent of modern search engines like Google that users could search for information quickly and efficiently across the internet's vast and rapidly growing landscape.

Also read: ChatGPT vs. Hybrids: The future depends on our choices

Empowering productivity: How large language models transform information retrieval

Today, people must read through multiple contents across various sources to form their understanding of a particular concept. Reading can be tedious, and consuming enough content to grasp a profound concept might take time. LLMs, based on vast amounts of textual data available on the Internet, use machine learning to offer a solution to this challenge. Based on this understanding of the topic, the LLMs can automatically create a summary on a particular topic or respond to various related queries. Thus, LLMs can be a productivity-enhancing tool that can save hours of manual research by providing quick answers and have applications across multiple domains. LLMs can bring transformative changes across domains like legal consulting, healthcare, education, market research, and so on. The ethical concerns raised on LLMs related to data access, privacy, and liability must be resolved for broader adoption and to reach their transformative potential.

Read More

Navigating the legal landscape: Ethical considerations in training large language models

One of the primary concerns in the development of LLMs is the access and usage of data. There is a large amount of online content available on the internet. Does it imply that the information is freely available for training LLMs leading to possible monetization of the content by a third party? Different websites have their terms of use for online content, and LLMS need to abide by these terms. Often non-commercial use of the available content is permitted, but the LLM might have subscription fees that would make it a commercial usage. Here we take an approach like those followed by search engines that crawl the entire web and collect web pages and their content which are then indexed for the user to search. Websites have a file named “robots.txt” in their home/root directory, instructing search engine robots and crawlers about which pages or sections of a website they can visit and index. The file “robots.txt” also includes directives for specific search engines or bots and instructions for the frequency and delay of crawls. It is a tool for website owners to control their site's visibility and accessibility to search engines and other automated agents. The protocol for using “robots.txt” was decided based on a consensus between the major search engines, website owners, and other related parties.

On similar lines, organisations working on developing LLMs, representatives from media firms publishing news articles, social media firms, web sites may come together to form a consensus on the allowable use of the content for training LLM models. All websites and content repositories can have a file named “llms.txt” containing information on what website content can be used to train LLMs. Further, LLM developers should form exclusive revenue-sharing agreements with websites containing significant informative content like news portals, digital bookstores, online libraries, scientific or academic publications collections, and so on. Australia's News Media Bargaining Code has enabled revenue-sharing agreements between media companies in Australia with social media/search companies like Google and Facebook.

Safeguarding personal data: Privacy concerns and responsibilities in LLMs

Data privacy regulations, like GDPR in Europe, protect individuals' privacy. This section highlights the importance of respecting individual privacy and complying with privacy regulations when collecting and using data for training LLMs. The data privacy norms provide guidelines for collecting, storing, and further using personal data, along with penalties for violating policies. None of the data privacy regulations supports unauthorised data processing without the consent of the person whose data is collected. Web scrapers can detect personal data, and LLMs should completely ignore the collection or storage of such personal data or further use them for training the LLM models. It emphasises the need for LLM developers to delete any personal information stored in the training documents proactively. As LLMs gain popularity based on applications across sectors, government-backed regulators would bring in audit requirements for the LLMs to ensure the sanctity of the data in terms of privacy breaches. It would be good practice for the LLM developers to continuously scrutinise the data used for training and proactively delete any personal information that might be stored in the documents.

Also read: 6 AI governance principles to help enterprises cope with risk in the fast-moving world

Seeking truth and transparency: Addressing bias, incompleteness, and liability in LLMs

Another significant concern with LLMs is the accuracy and accountability of the responses generated by the machine. This relates to subjective topics and situations where the LLM might need more information to understand the topic during the training phase better. People might challenge the validity of the responses involving such subjective topics or with incomplete information. Developing bias-free fair applications and adequately explaining the machine’s decisions are the prime ingredients for a trustworthy and responsible AI system like LLMs. A subject matter expert on each topic of interest would be able to validate the completeness of information on that topic—bringing human experts in the loop would contradict the basic premise of machines learning from massive volumes of data. It would be easier for the LLM to explicitly provide references to the articles or websites for the basis of the responses. LLMs now offer references to response upon request, which should be embedded in the ecosystem. Including references in LLM responses is discussed to provide transparency, attribution, and protection against copyright violations.

(views expressed are personal)

Avik Sarkar, Institute of Data Science, Indian School of Business (ISB)

[This article has been reproduced with permission from the Indian School of Business, India]