Google has made some significant improvements to the speech recognizer on its mobile phones. The new software outputs every single character in real time and is entirely contained on the mobile device, which means that the dictation system will work offline with zero latency.
Johan Schalkwyk, a Google Fellow with the company’s Speech Team, explained the new system in a recent AI Blog post. According to Schalkwyk, more conventional speech recognition systems convert speech to text using a sequence that involves three separate steps, beginning with an analysis of an audio sample to identify specific sounds. The software then uses those sounds to form words and a language model to complete the sentence.
The drawback is that those traditional systems require a complete input sequence in order to generate a transcription. Google’s team used Recurrent Neural Network transducer (RNN-T) technology to convert audio input to text output on a character-by-character basis, improving speed by outputting each individual letter instead of a longer word or phrase.
The new platform is also smaller than its predecessors, reducing the speech recognizer footprint from 2 GB to 80 MB. At the former size, speech recognizers are too unwieldly to store on a mobile device and therefore require a network connection in order to function. The new dictation system is small enough to embed on a standard smartphone and will be available to customers on or offline.
For now, the new speech recognizer will only be available in American English on Pixel phones, though Google hopes to launch the service for more languages and devices soon. The announcement is the latest RNN breakthrough for the company’s speech recognition team, which achieved human parity back in 2017.
Source: Google AI Blog