Skip to main content

How to improve the traditional ASR using Connectionist Temporal Classification

The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate.  At this rate, ASR users are often frustrated with the experience with using such a system.

The tradition ASR is often fragile:

1) requires extensive modification of parameters, just to make it work.
2) requires extensive understanding of a language model and a acoustic model.
3) doesn't scale well to multiple languages.
4) hyper-sensitive to speaker variants.


Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy.

What if, we can do a DL from end to end?








Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz.  At each spectrogram interval, a DL neural network can be assigned, individually.



The basic idea is to have RNN output neurons to encode distribution over "symbols".

The traditional ASR uses a phoneme-based model or a graphme-based model.  Again, suspectible to speaker variants.  e.g. If a speaker speaks slowly 'hello' over 10 seconds or 5 seconds, how do we map each phoneme to a neuron?

CTC allows the temporaral mapping to each DNN/RNN neuron by using softmax on top of a dense layer to provide the best possibility model. 





Using DNN/RNN, train the model over many many days on a high-end compute machine, we were able to have the model to transcribe a voice recording in English.  Our accuracy is around 92%, and it is not sensitive to speaker variants.

Trainging Deep Speech Recognition (NLP) is tricky.  The idea is to use SortaGrad (Bengio et al, ICML 2009)





References

• Gales and Young. “The Applica,on of Hidden Markov Models in Speech Recogni,on” Founda,ons and Trends in Signal Processing, 2008. 
• Jurafsky and Mar,n. “Speech and Language Processing”. Pren,ce Hall, 2000. 
• Bourlard and Morgan. “CONNECTIONIST SPEECH RECOGNITION: A Hybrid Approach”. Kluwer Publishing, 1994. 
• A Graves, S Fernández, F Gomez, J Schmidhuber. “Connec,onist temporal classifica,on: labelling unsegmented sequence data with recurrent neural networks.” ICML, 2006. 
• Hannun, Maas, Jurafsky, Ng. “First-­‐Pass Large Vocabulary Con,nuous Speech Recogni,on using Bi-­‐Direc,onal Recurrent DNNs” ArXiv: 1408.2873 
• Hannun, et al. “Deep Speech: Scaling up end-­‐to-­‐end speech recogni,on”. ArXiv:1412.5567 
• H. Hermansky, "Perceptual linear predic,ve (PLP) analysis of speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-­‐1752, Apr. 1990. 
• H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-­‐589, Oct. 1994. 
• H. Schwenk, “Con,nuous space language models”, 2007. 







Comments

Popular posts from this blog

How to train a neural network to retrieve 3D maps from videos

This blog is about how to train a neural network to extract depth maps from videos of moving people captured with a monocular camera. Note: With a monocular camera, extracting the depth map of moving people is difficult.  Difficulty is due to the motion blur and the rolling shutter of an image.  However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos. This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS. To start this project with a SfM and a MVS, we will use TUM Dataset. So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training. The RGB-D SLAM reference implementation from these papers are used: - RGB-D Slam (Robotics OS) - Real-time 3D Visual SLAM ...

How to use Convolution Neural Network to predict SIFT features

A feature locator is essential in all CV domain.  It's the basis of the germetric transformation, epipolar geometry, to 3D mesh reconstruction. Many techniques - SIFT and other SLAM technologies, are available, but they require ideal environments to work in. To address the short comings: - sensitive to low texture environment - sensitive to low light envonrment - sensitive to high light environment (like outdoor day light with above 20k lux) - and many other issues I propose a CNN based neural network to detect 4 correspondences in an image A and an image B. Since it is tricky to have a neural network to predict a 4x4 affine matrix of rotation and translation, I separated the translation vector from the rotation vector. Basically, the ground truth data will be precalcalated with a generic SIFT with RANSAC to calculate the correspondences set P and P'. The L2 (Eucledean) distance will be used between a predicted value.  They are 4 points, so an averaged will ...