How to improve the traditional ASR using Connectionist Temporal Classification

The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate. At this rate, ASR users are often frustrated with the experience with using such a system.

The tradition ASR is often fragile:

1) requires extensive modification of parameters, just to make it work.
2) requires extensive understanding of a language model and a acoustic model.
3) doesn't scale well to multiple languages.
4) hyper-sensitive to speaker variants.

Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy.

What if, we can do a DL from end to end?

Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz. At each spectrogram interval, a DL neural network can be assigned, individually.

The basic idea is to have RNN output neurons to encode distribution over "symbols".

The traditional ASR uses a phoneme-based model or a graphme-based model. Again, suspectible to speaker variants. e.g. If a speaker speaks slowly 'hello' over 10 seconds or 5 seconds, how do we map each phoneme to a neuron?

CTC allows the temporaral mapping to each DNN/RNN neuron by using softmax on top of a dense layer to provide the best possibility model.

Using DNN/RNN, train the model over many many days on a high-end compute machine, we were able to have the model to transcribe a voice recording in English. Our accuracy is around 92%, and it is not sensitive to speaker variants.

Trainging Deep Speech Recognition (NLP) is tricky. The idea is to use SortaGrad (Bengio et al, ICML 2009)

References

• Gales and Young. “The Applica,on of Hidden Markov Models in Speech Recogni,on” Founda,ons and Trends in Signal Processing, 2008.

• Jurafsky and Mar,n. “Speech and Language Processing”. Pren,ce Hall, 2000.

• Bourlard and Morgan. “CONNECTIONIST SPEECH RECOGNITION: A Hybrid Approach”. Kluwer Publishing, 1994.

• A Graves, S Fernández, F Gomez, J Schmidhuber. “Connec,onist temporal classifica,on: labelling unsegmented sequence data with recurrent neural networks.” ICML, 2006.

• Hannun, Maas, Jurafsky, Ng. “First-‐Pass Large Vocabulary Con,nuous Speech Recogni,on using Bi-‐Direc,onal Recurrent DNNs” ArXiv: 1408.2873

• Hannun, et al. “Deep Speech: Scaling up end-‐to-‐end speech recogni,on”. ArXiv:1412.5567

• H. Hermansky, "Perceptual linear predic,ve (PLP) analysis of speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-‐1752, Apr. 1990.

• H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-‐589, Oct. 1994.

• H. Schwenk, “Con,nuous space language models”, 2007.

Artificial Intelligence and Machine Vision

Search This Blog

How to improve the traditional ASR using Connectionist Temporal Classification

Comments

Post a Comment

Popular posts from this blog

How to project a camera plane A to a camera plane B

How to create a holographic camcorder

How to reduce TOF errors in AR glasses