For English and Hindi speaking gadget users, tools like Siri, Alexa, and the Google Assistant come in handy. From culling out phone numbers, making calls, and getting answers for search queries, these artificial intelligence-based solutions have emerged as virtual secretaries.

It is not the same case with the users in regional languages. The basic problem is scarcity of voice data samples in the databases. “You need to have datasets containing speech is crucial. In order to address this challenge, we have launched a pilot with a set of users,” an executive of International Institute of Information of Technology (IIIT-H) said.

The team will work with academic institutions in Andhra Pradesh and Telangana to collect at least 2,000 hours of spoken Telugu in the next one year.

The team would also work with Telugu Wikipedia community and industry partners such as OzoneTel and Pactera Edge in order to leverage the data from their networks.

The Technology Development for Indian Languages (TDIL) initiative, informally known as ‘Bahu Bhashik’, of the Union Ministry of Electronics and Information Technology, is working overcoming language barriers to facilitate proliferation of information and communication technologies in all the languages.

“This involves automatic speech recognition, speech-to-speech translation and speech-to-text translation,” the executive said.

The IIIT-H is working with the government to develop an Automatic Speech Recognition (ASR) module for translation of Indian languages.

Headed by Prakash Yalla, Head (Technology Transfer Office), and Anil Kumar Vuppala, Associate Professor at the Speech Processing Centre.

“To build AI-enabled automatic speech recognition systems, we need thousands and thousands of hours of speech data, along with transcribed text of the same for each language,” the executive said.

“In our lab, we have been working on speech recognition technology for the last 10 years and have collected data too. But it is of the order of 50-60 hours. Now, we need thousands of hours of data,” Anil Vuppala said.

“The main challenge here is not limited to the audio or speech file alone. The important thing is fragmenting the speech files, and writing them down in the form of text. It’s a very laborious process,” he said in the institute’s blog.

The initial collection of Telugu speech data is expected to lead to the establishment of protocols and systems in place for crowd sourcing of data for all Indian languages.

“If everything works, it’ll become a nation-wide data collection exercise, probably the largest ever and we’ll make it available to the general public free of cost,” Prakash Yalla said.