The Indian Institute of Sciences (IISc), ARTPARK (AI and Robotics Technology Park) and Google have come together for an unusual digital project – mapping the language diversity of India by collecting speech sets of about a million people across 773 districts over three years. Project Vaani, as it is called, intends to record over 150,000 hours of speech, part of which will be transcribed in local scripts.
“Over the past few months, we have gathered data from nearly 69 districts across India. So far, we have collected nearly 150 hours of data covering more than 30 languages from 841 different pincodes in a gender and age-balanced manner; this also includes some of the endangered tribal languages, such as Heidi,” said Govindan Rangarajan, Director of IISc, during the project launch in Delhi.
“For Project Vaani, we decided to opt for a district-anchored approach, wherein we will go to every district and record whatever speech is spoken locally,” explained Prasanta Kumar Ghosh, IISc.
Ghosh said they randomly select around 1,000 people from a district, show them some photos from their area, and ask them to describe it in a language they speak locally or at home.
“It is a unique initiative, and the data produced by it will be one of a kind. The support from Google is crucial for meeting Vaani’s goal of scaling the project pan India,” said Rangarajan. Google is providing additional resources, such as Cloud, in addition to funding the project. According to Ghosh, the project cost is in the range of $30–40 million.
Objective of the project
The goal of the project is to boost the development of technologies such as automatic speech recognition, speech to speech translation and natural language understanding. Given that in India the dialect changes in literally every district, the project takes a district level approach.
“The greater objective is to create a technological solution that will eliminate the linguistic barriers that currently exist in technology and increase accessibility for a wider range of people,” said Raghu Dharmaraju, president, ARTPARK.
Language AI is fundamentally important, as much of what is available on the internet is in English. “To build a more comprehensive lexicon, we’re attempting to capture more diversity and natural speech,” said Dharmaraju.