Mitesh Khapra's AI4Bharat is closing the AI gap between English and Indian languages

Mitesh M Khapra's AI4Bharat has led a groundbreaking effort to lay the digital foundations so that AI can be built for Indian languages

  • Published:
  • 16/06/2025 02:15 PM

Mitesh M Khapra, Founder, AI4Bharat, and Associate Professor, IIT-Madras Image: Mexy Xavier


When he co-founded AI4Bharat in 2020, “our mission was very simple, that for Indian languages, AI technologies should be at par with English”, recalls Mitesh M Khapra. “That’s easy to state, but quite difficult to achieve.”


The biggest challenge was the absence of adequate volumes of data, when AI4Bharat started as a research lab at IIT-Madras, where Khapra is an associate professor. Khapra and fellow researchers started small, painstakingly creating datasets and models for a handful of languages. They also collaborated with companies such as Microsoft.


The effort got a shot in the arm when AI4Bharat was invited to become the data management unit for India’s Bhashini Mission, backed by the Ministry for Electronics and IT, which aims to make all digital services accessible in every Indian language. Today, AI4Bharat is responsible for about 80 percent of the data powering Bhashini.


When AI4Bharat started, LLMs had not arrived on the scene yet, and the research lab’s “focus was on four verticals: Machine translation, speech recognition, speech synthesis, and optical character recognition or document parsing,” Khapra says. “So, we were given the task of collecting data for all of these so that models can be trained.”


In addition to Bhashini, a `36 crore grant from the Nilekani Philanthropies in 2022 also helped AI4Bharat double down on its work. Pratyush Kumar, now co-founder of Sarvam AI, was another co-founder, and Anoop Kunchukuttan, a researcher at Microsoft Research was an external collaborator. And Vivek Raghavan, also co-founder of Sarvam—and chief AI evangelist at the Nilekanis’ EkStep Foundation—remains a mentor to AI4Bharat.

View the full list here 


One flagship project involves collecting natural speech data from diverse demographics across the country—farmers, students, homemakers—capturing the true diversity of Indian voices and dialects. This project has covered 22 Indian languages collecting voice samples from 400 districts and Khapra expects the bulk of the project to be done by September.


Another milestone: A state-of-the-art studio at IIT-Madras where professional voice actors record high-quality samples. “As we speak, there’s a Kashmiri voice artist, a Gujarati voice artist and a Maithili voice artist,” he says. “Then we’ll develop what are known as speech synthesis models.”


“The past three to four years have been about building this ramp, which can lay the foundation for Indian language technology,” Khapra says. “And everything has been done in the open source, both datasets and models, because we are keen that the ecosystem should also use it and sort of build on top of it.”


“If you can create high quality data, like AI4Bharat is, in Indian languages and share it with everyone, it will accelerate by years the pace at which things get rolled out,” Nandan Nilekani, former chairman of India’s Unique ID Authority, said in an interview with The Economic Times published on March 10. “The hard part is uses. How do we use this to deliver value for people? That is where we can lead,” Nilekani added. 


Also read: Meet Mausam, the IIT Delhi professor who wants every Indian to study AI


Work at AI4Bharat has reached a stage where it will create a snowball effect, Khapra says. “We have reached a stage where our data and models can generate more data. Now, you can see the snowball effect that things will start sort of quickly, build up,” he says. “And that sets us up for making an attempt at building a sovereign LLM for Indian languages.” And indeed, AI4Bharat is collaborating closely with Sarvam AI, to build India’s first sovereign AI model from scratch.


Thus far, AI4Bharat has created India’s largest open-source language resource, including datasets for machine translation, speech recognition, speech synthesis, and document OCR. Among them, Samanantar is the largest publicly available parallel dataset, which includes 49.7 million sentence pairs between English and 11 Indic languages. This corpus has been instrumental in training multilingual ‘neural machine translation’ (NMT) models that outperform existing benchmarks, according to the lab’s website.


IndicTrans is a transformer-based multilingual NMT model trained on Samanantar, and IndicTrans2 is the first open-source model supporting high-quality translations across all 22 scheduled Indic languages.


“Three to four years ago, we were at least a decade behind where English technology was, and we’ve been trying to catch up and reach a respectable point,” Khapra says. “Of course, English has also been running ahead, but because of the groundwork we’ve laid, we see a path to be able to close the gap.”


Even as LLMs have been built for English, “now we think that in another six to nine months, we should be able to close that gap”, he adds.


One other ambitious project Khapra hopes to expand is a ‘Bharat Cultural Repository’, a multimodal digital archive for every village in India, capturing local recipes, festivals, monuments, and oral histories in text, audio, and video. If it succeeds, the effort will be both a priceless work of preserving India’s cultural diversity and heritage, but also a massive database that can be tapped for any number of practical applications.


Last Updated :

June 16, 25 02:39:56 PM IST