What does it take to make a computing machine understand a language? It involves making it recognise pronunciations in a particular language, and making it understand how sentences are formed.

Making computers understand languages holds the key to developing solutions that let non-English speaking Indians make full use of computing and mobile applications.

This requires a good set of ‘voice models’ of a language, and capturing all its nuances and idiosyncrasies.

Microsoft, which has been collecting voice models for the last few years, has decided to make its Indian language Speech Corpus available to companies and research institutes. This will allow them to work on voice and language models that ultimately would facilitate inter-language conversations and transactions.

Microsoft’s open data corpus, available presently in Tamil, Gujarati and Telugu, would help researchers and academia build Indian language speech recognition applications.

Sundar Srinivasan, General Manager (Artificial Intelligence and Research), Microsoft India, claims this is the largest publicly available Indian language speech dataset.

“There are two components — understanding phonetics and understanding the language and transcribing it in a written form,” he said.

According to him, there is a scarcity of digital data for text, speech and linguistic resources, which hold the key to developing machine learning models for local languages. Microsoft has worked with its partners in building the digital corpus, giving representative references to distinct differences in accent, slang and diction.

“The corpus of data that we are making available would go a long way in addressing this challenge,” he said.

He, however, did not spelt out the time-line for introduction of digital data corpus for other Indian languages.

The corpus was tested at the just-concluded Interspeech 2018 held here In a Low Resource Speech Recognition Challenge, where a few participants used data from Microsoft’s newly-launched corpus to build Automatic Speech Recognition (ASR) systems.

Parallelly, a Union Government-led consortium has been working for over a decade to enable inter-language translations.

The initiative, titled Sampark, is aimed at facilitating translations by machines in Indian languages. The consortium includes the International Institute of Information Technology (Hyderabad), the University of Hyderabad, C-DAC, Anna University KBC Chennai, and a few IITs and IIITs.

comment COMMENT NOW