The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate. At this rate, ASR users are often frustrated with the experience with using such a system.
The tradition ASR is often fragile:
1) requires extensive modification of parameters, just to make it work.
2) requires extensive understanding of a language model and a acoustic model.
3) doesn't scale well to multiple languages.
4) hyper-sensitive to speaker variants.
Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy.
What if, we can do a DL from end to end?
Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz. At each spectrogram interval, a DL neural network can be assigned, individually.
The tradition ASR is often fragile:
1) requires extensive modification of parameters, just to make it work.
2) requires extensive understanding of a language model and a acoustic model.
3) doesn't scale well to multiple languages.
4) hyper-sensitive to speaker variants.
Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy.
What if, we can do a DL from end to end?
Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz. At each spectrogram interval, a DL neural network can be assigned, individually.
The basic idea is to have RNN output neurons to encode distribution over "symbols".
The traditional ASR uses a phoneme-based model or a graphme-based model. Again, suspectible to speaker variants. e.g. If a speaker speaks slowly 'hello' over 10 seconds or 5 seconds, how do we map each phoneme to a neuron?
CTC allows the temporaral mapping to each DNN/RNN neuron by using softmax on top of a dense layer to provide the best possibility model.
Using DNN/RNN, train the model over many many days on a high-end compute machine, we were able to have the model to transcribe a voice recording in English. Our accuracy is around 92%, and it is not sensitive to speaker variants.
Trainging Deep Speech Recognition (NLP) is tricky. The idea is to use SortaGrad (Bengio et al, ICML 2009)
References
• Gales and Young. “The Applica,on of Hidden Markov Models in Speech Recogni,on” Founda,ons and Trends in Signal Processing, 2008.
• Jurafsky and Mar,n. “Speech and Language Processing”. Pren,ce Hall, 2000.
• Bourlard and Morgan. “CONNECTIONIST SPEECH RECOGNITION: A Hybrid Approach”. Kluwer Publishing, 1994.
• A Graves, S Fernández, F Gomez, J Schmidhuber. “Connec,onist temporal classifica,on: labelling unsegmented sequence data with recurrent neural networks.” ICML, 2006.
• Hannun, Maas, Jurafsky, Ng. “First-‐Pass Large Vocabulary Con,nuous Speech Recogni,on using Bi-‐Direc,onal Recurrent DNNs” ArXiv: 1408.2873
• Hannun, et al. “Deep Speech: Scaling up end-‐to-‐end speech recogni,on”. ArXiv:1412.5567
• H. Hermansky, "Perceptual linear predic,ve (PLP) analysis of speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-‐1752, Apr. 1990.
• H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-‐589, Oct. 1994.
• H. Schwenk, “Con,nuous space language models”, 2007.
Comments
Post a Comment