The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate. At this rate, ASR users are often frustrated with the experience with using such a system.
The tradition ASR is often fragile:
1) requires extensive modification of parameters, just to make it work.
2) requires extensive understanding of a language model and a acoustic model.
3) doesn't scale well to multiple languages.
4) hyper-sensitive to speaker variants.
Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy.
What if, we can do a DL from end to end?
Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz. At each spectrogram interval, a DL neural network can be assigned, individually.
The basic idea is to have RNN output neurons to encode distribution over "symbols".
The traditional ASR uses a phoneme-based model or a graphme-based model. Again, suspectible to speaker variants. e.g. If a speaker speaks slowly 'hello' over 10 seconds or 5 seconds, how do we map each phoneme to a neuron?
CTC allows the temporaral mapping to each DNN/RNN neuron by using softmax on top of a dense layer to provide the best possibility model.
Using DNN/RNN, train the model over many many days on a high-end compute machine, we were able to have the model to transcribe a voice recording in English. Our accuracy is around 92%, and it is not sensitive to speaker variants.
Trainging Deep Speech Recognition (NLP) is tricky. The idea is to use SortaGrad (Bengio et al, ICML 2009)
