Google — all ears and voice

Swetha Kannan Updated - April 24, 2011 at 07:22 PM.

Mike Cohen

As a research scientist who leads the speech technology efforts at Google, Mike Cohen has a huge task on hand. Speech technology essentially involves speech recognition and synthesis. It recognises various languages and accents and turns what is being spoken into text or vice versa. The technology has been built into many of Google's products — be it in voice search on smartphone, voice action which does simple actions such as making calls and sending text messages based on voice commands, or in YouTube wherein the audio track of videos gets transcribed into any language.

Google's long-term vision is to bring speech recognition and synthesis completely and ubiquitously available on every application, in every usage scenario, across devices. Obviously a lot to chew on... But Cohen, who has been working on speech technologies at Google from 2004, believes Google is on the right track as it goes about accumulating huge amounts of voice data to perfect its model. So what exactly do Cohen and his team do? eWorld talks to the man who gave Google ears and voice talks, over a video-conference….

How has speech technology been built into Google products?

Our research is in the context of solving real problems and improving products, instead of being siloed in the laboratory. The first instance is that of voice search on smartphones such as Android and iPhone. Instead of typing on the Google search bar, you can talk to it and the search results are given. Voice search on the mobile device today supports 25 spoken languages. Around 20 languages have been added in the past year.

Other products in the market include ‘voice actions' on an Android phone. Using this, you can actually say ‘Send a text message to Steve Taylor: Please come to the club'. And the system will actually find Steve's number, create the message and send it to him. You could also say ‘Call Steve' and the call is made.

Another service on Android is called ‘voice input'. Whenever the keypad pops up on the Android screen, there is also a microphone button. You hit the button and any application from the app store wherein you have to fill out or type something, you can actually speak it. The application developer may not even know his application is voice enabled; he does not have to do anything special to have his app voice-enabled. He may not even know that some of the inputs coming in are through voice; he just gets the text stream (the voice is converted to text). For e.g.: I downloaded an app called ‘Myfitness' for tracking what I have eaten and the number of calories consumed. I can go to the application and instead of typing out what I had for lunch, I can say ‘chicken salad sandwich' and that gets entered as text and my calorie is computed.

We have also integrated voice technology in Youtube for English videos. In every English youtube video that gets uploaded, the audio track of the video gets sent through the speech recognisor, which transcribes it and creates a caption track. So that you can read what's being said. It is useful for the hearing impaired. We also connect it to Google's machine translation technology — so that you can take that caption track and translate it into any language. For foreigners watching an English movie, this is useful. We are thus breaking the language barrier in Youtube.

Today, we can transcribe voice mail if you feel like reading it instead of listening to it.

What's the underlying technology behind all this?

Speech recognition today is basically a data-driven process. To build a speech recognition system for a particular language, we have to feed the system lots and lots of real data — people speaking that language. From that data, the system learns a statistical model of the language.

Rather than somebody explicitly programming that an ‘a' sounds like this, the context in which ‘a' can come in and these are the words that happen and their probability and which words will follow, this is the grammar and all that, we built a statistical model and feed lots and lots of data and the system automatically learns the structure of the language.

The statistical model has three pieces to it. The first piece is that of all the fundamental sounds of the language and the frequencies of the sounds, such as the aahs, boos, moos. They are the basic phonetics that make up the language, depending on the context and what comes after and before the word. Then we have the ‘word pronunciation' piece. We have fed a lexicon or a list of millions of words into the system which defines the pronunciation of words. The final piece is the grammatical piece which is about what words follow other words. All these are learnt only by feeding lots and lots of data into the system. The system then learns more and improves its performance. As more people use our products, we get more data and we can make the model better and richer, add more features and parameters.

What about the role of instinct in the model?

The role is only to the degree to which the instincts get learned in the statistical model. A human being has those instincts based on lots and lots of speech and conversation. The machine also develops instinct and knows what to expect through what it has heard. This can happen only through experience with millions of cases of people doing voice searches. As more sounds come in, it is easy for the speech recognisor to guess the best match in the model built. We don't explicitly programme anything.

So, what's Google trying to achieve through all this?

Google's mission statement has several pieces to it. The first one is to organise the world's information. And a lot of that information is spoken information. We are also breaking the language barrier and making it readable. The other part is to make the information universally accessible, convenient and useful. For instance, it is hard to type out on a tiny keypad on the mobile device or you may be driving or walking. Today it is easy to find any address – you talk to your phone and the phone tells you what turns to take.

Our long-term vision is to make speech technology completely ubiquitously available on every application, in every usage scenario, across devices.

What has been the success rate?

The success depends on the application and it differs from language to language. But it is high enough for lots of people to come back and use it.

We have higher accuracy in something like voice search than youtube transcription. The topic matter for youtube videos is wide-ranging and harder to predict than voice search. While you can search for just about anything, it is still more predictable than what people might talk about in a youtube video.

The uptick and usage of voice search has grown rapidly in the past year. It is very accurate though not perfect. But it works well enough to be useful. And repeat usage is also seen with people using it multiple times a day. There is still a long way to go to get close to our big vision. There is lots and lots of research to do. We are working hard to improve the efficiency. But we are pretty happy with the progress made so far.

A few years ago, most people, including those in the speech recognition field, would have been surprised to be able to look ahead a few years and say you could do google search by voice. Now it's used all the time, and works quite well.

What are the various challenges faced as you try to improve your technologies?

Infrastructure… as this is related to latency. Better the infrastructure, snappier the interaction and the more useful for people to use it more. More important is connectivity. In youtube, there is often background noise. Getting accurate speech recognition even amid background noise or music track is a big challenge. Unusual accents don't get recognised easily. We want to cover more accents easily. This is a big area of research. Variability or lack of predictability in what people will talk about is another challenge.

What's brewing in your labs currently?

Speech-to-speech translation. Experimental products are already there. You say something in some language and it gets recognised and translated and spoken by the speech synthesiser into another language. For example: If I am travelling in France and I don't speak French… I take out my phone and ask ‘where is the nearest post office?' This is translated into French so that the person standing next to me who speaks French can hear it in French and guide me. It is a hard technology but we have put it together.

What's keeping you bullish about the future?

The usage of smartphone is growing rapidly. Also, another big reason for optimism is that our fundamental paradigm is data-driven and the model is built using machine learning. The machine learns from data. As we get more data, we can build a richer and bigger model to take advantage of the data. And we will see improvements in performance. Google has access to more data than anyone else. That is why we are making significant progress…

> swethak@thehindu.co.in

Published on April 24, 2011 12:52