Mitesh Khapra's AI4Bharat is closing the AI gap between English and Indian langu...

Mitesh M Khapra's AI4Bharat has led a groundbreaking effort to lay the digital foundations so that AI can be built for Indian languages

Last Updated: Jun 16, 2025, 14:15 IST4 min

Mitesh M Khapra, Founder, AI4Bharat, and Associate Pro...

When he co-founded AI4Bharat in 2020, “our mission was very simple, that for Indian languages, AI technologies should be at par with English", recalls Mitesh M Khapra. “That’s easy to state, but quite difficult to achieve."

Also Read

Meet Mausam, the IIT Delhi professor who wants every Indian to study AI

Work at AI4Bharat has reached a stage where it will create a snowball effect, Khapra says. “We have reached a stage where our data and models can generate more data. Now, you can see the snowball effect that things will start sort of quickly, build up," he says. “And that sets us up for making an attempt at building a sovereign LLM for Indian languages." And indeed, AI4Bharat is collaborating closely with Sarvam AI, to build India’s first sovereign AI model from scratch.

Thus far, AI4Bharat has created India’s largest open-source language resource, including datasets for machine translation, speech recognition, speech synthesis, and document OCR. Among them, Samanantar is the largest publicly available parallel dataset, which includes 49.7 million sentence pairs between English and 11 Indic languages. This corpus has been instrumental in training multilingual ‘neural machine translation’ (NMT) models that outperform existing benchmarks, according to the lab’s website.

IndicTrans is a transformer-based multilingual NMT model trained on Samanantar, and IndicTrans2 is the first open-source model supporting high-quality translations across all 22 scheduled Indic languages.

“Three to four years ago, we were at least a decade behind where English technology was, and we’ve been trying to catch up and reach a respectable point," Khapra says. “Of course, English has also been running ahead, but because of the groundwork we’ve laid, we see a path to be able to close the gap."

Even as LLMs have been built for English, “now we think that in another six to nine months, we should be able to close that gap", he adds.

One other ambitious project Khapra hopes to expand is a ‘Bharat Cultural Repository’, a multimodal digital archive for every village in India, capturing local recipes, festivals, monuments, and oral histories in text, audio, and video. If it succeeds, the effort will be both a priceless work of preserving India’s cultural diversity and heritage, but also a massive database that can be tapped for any number of practical applications.

First Published: Jun 16, 2025, 14:15

AI4Bharat Ministry for Electronics and IT

Subscribe Now

Harichandan Arakali

I'm the Technology Editor at Forbes India and I love writing about all things tech. Explaining the big picture, where tech meets business and society, is what drives me. I don't get to do that every d

Home

Ai-special-2025

Mitesh-khapras-ai4bharat-is-closing-the-ai-gap-between-english-and-indian-languages

Podcasts

Videos

Thought Leadership

Lists

Mitesh Khapra's AI4Bharat is closing the AI gap between English and Indian langu...

Mitesh M Khapra's AI4Bharat has led a groundbreaking effort to lay the digital foundations so that AI can be built for Indian languages

Popular News

Latest News