The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate. At this rate, ASR users are often frustrated with the experience with using such a system. The tradition ASR is often fragile: 1) requires extensive modification of parameters, just to make it work. 2) requires extensive understanding of a language model and a acoustic model. 3) doesn't scale well to multiple languages. 4) hyper-sensitive to speaker variants. Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy. What if, we can do a DL from end to end? Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz. At each spectrogram interval, a DL neural network can be assigned, individually. The basic idea is to have RNN output neurons to encode distribution over "symbols". The traditional ASR uses a phone...
About State of the Art in Artificial Intelligence and Machine Vision