Skip to main content

How to improve the traditional ASR using Connectionist Temporal Classification

The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate.  At this rate, ASR users are often frustrated with the experience with using such a system.

The tradition ASR is often fragile:

1) requires extensive modification of parameters, just to make it work.
2) requires extensive understanding of a language model and a acoustic model.
3) doesn't scale well to multiple languages.
4) hyper-sensitive to speaker variants.


Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy.

What if, we can do a DL from end to end?








Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz.  At each spectrogram interval, a DL neural network can be assigned, individually.



The basic idea is to have RNN output neurons to encode distribution over "symbols".

The traditional ASR uses a phoneme-based model or a graphme-based model.  Again, suspectible to speaker variants.  e.g. If a speaker speaks slowly 'hello' over 10 seconds or 5 seconds, how do we map each phoneme to a neuron?

CTC allows the temporaral mapping to each DNN/RNN neuron by using softmax on top of a dense layer to provide the best possibility model. 





Using DNN/RNN, train the model over many many days on a high-end compute machine, we were able to have the model to transcribe a voice recording in English.  Our accuracy is around 92%, and it is not sensitive to speaker variants.

Trainging Deep Speech Recognition (NLP) is tricky.  The idea is to use SortaGrad (Bengio et al, ICML 2009)





References

• Gales and Young. “The Applica,on of Hidden Markov Models in Speech Recogni,on” Founda,ons and Trends in Signal Processing, 2008. 
• Jurafsky and Mar,n. “Speech and Language Processing”. Pren,ce Hall, 2000. 
• Bourlard and Morgan. “CONNECTIONIST SPEECH RECOGNITION: A Hybrid Approach”. Kluwer Publishing, 1994. 
• A Graves, S Fernández, F Gomez, J Schmidhuber. “Connec,onist temporal classifica,on: labelling unsegmented sequence data with recurrent neural networks.” ICML, 2006. 
• Hannun, Maas, Jurafsky, Ng. “First-­‐Pass Large Vocabulary Con,nuous Speech Recogni,on using Bi-­‐Direc,onal Recurrent DNNs” ArXiv: 1408.2873 
• Hannun, et al. “Deep Speech: Scaling up end-­‐to-­‐end speech recogni,on”. ArXiv:1412.5567 
• H. Hermansky, "Perceptual linear predic,ve (PLP) analysis of speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-­‐1752, Apr. 1990. 
• H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-­‐589, Oct. 1994. 
• H. Schwenk, “Con,nuous space language models”, 2007. 







Comments

Popular posts from this blog

How to project a camera plane A to a camera plane B

How to Create a holographic display and camcorder In the last part of the series "How to Create a Holographic Display and Camcorder", I talked about what the interest points, descriptors, and features to find the same object in two photos. In this part of the series, I'll talk about how to extract the depth of the object in two photos by calculating the disparity between the photos. In order to that, we need to construct a triangle mesh between correspondences. To construct a mesh, we will use Delaunnay triagulation.  Delaunnay Triagulation - It minimizes angles of all triangles, while the sigma of triangles is maximized. The reason for the triangulation is to do a piece wise affine transformation for each triangle mapped from a projective plane A to a projective plane B. A projective plane A is of a camera projective view at time t, while a projective plane B is of a camera projective view at time t+1. (or, at t-1.  It really doesn't matter)...

State of the Art SLAM techniques

Best Stereo SLAMs in 2017 are reviewed. Namely, (in arbitrary order) EKF-SLAM based,  Keyframe based,  Joint BA optimization based,  RSLAM,  S-PTAM,  LSD-SLAM,   Best RGB-D SLAMs in 2017 are also reviewed. KinectFusion,  Kintinuouns,  DVO-SLAM,  ElasticFusion,  RGB-D SLAM,   See my keypoints of the best Stereo SLAMs. Stereo SLAM Conditionally Independent Divide and Conquer EKF-SLAM [5]   operate in large environments than other approaches at that time uses both  close and far points far points whose depth cannot be reliably estimated due to little disparity in the stereo camera  uses an inverse depth parametrization [6] shows empirically points can be triangulated reliably, if their depth is less than about 40 times the stereo baseline.     - Keyframe-based  Stereo SLAM   - uses BA optimization in a local area to archive scalability.  ...

How to create a holographic camcorder

Since the invention of a camcorder, we haven't seen much of advancement of a video camcorder. Sure, there are few interesting, new features like capturing video in 360 or taking high resolution 4K content. But the content is still in 2D and we still watch it on a 2D display. Have you seen the movie Minority Report (2002)? There is a scene where Tom Cruise is watching a video recording of his lost son in 3D or holographically. Here is a video clip of this scene. I have been waiting for the technological advancement to do this, but it's not here yet. So I decided to build one myself. In order to build a holographic video camcorder, we need two devices. 1) a video recorder - a recorder which captures the video content in 3D or holographically. 2) a video display - a display device which shows the recorded holographic content in 3D or holographically. Do we have a technology to record a video, holographically. Yes, we can now do it, and I'll e...