Skip to main content

Finding a better local minima in Deep Learning

Training a DL model to find a local minima in n-dimensions can be a challenge.  Often, data scientists and ML engineers would use a gradient descent to optimize the path.

Starting delta may be anywhere between 1e-3 or 1e-4.  Having a constant gridient would not fast-approach a local minima.

There are few issues with this approach.

1) The first found local minima may not be the best minima.  It can be stuck in a sharp valley, where any deriviate change would raise the error rate above 50% or more.

2) The first found local minima may be a local mixima, as shown in the saddle point graph below.






When optimizing on n-th dimensions of space of a DL model, the best approach is to find a flat valley, when the SGD can locate a stable ground and where error rates stay low or relatively small to what it landed in the best optimization.

However, there are a better way than this.

Instead of manually entering an initial gradient decent value and updating it every epoch or mini-batch, why don't we use a cyclically variant gradient decent?



Here, the GD value actually follows a value of half cosine for the initial mini-batch.  The GD value changes, only when the validation set error doesn't change much.

The benefit of using this cyclic path of the learning rate is to kick the stuck optimized GD out of a sharp valley, so the error rate stays stable, as the learning rate stabilizes.


The better approach, is then, to accelerate the learning rate in a shorter cosine path, and de-accelerate the learning rate in a longer cosine path, as shown below.   This ensures the SGD can land on the flat valley in n-dimension space, since the number of dimension exponentially increase the number of potential local minima to explore.








References


1. Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. arXiv preprint arXiv:1506.01186

2. Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization.
arXiv preprint arXiv:1502.04390, 2015.

3 I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.
arXiv preprint arXiv:1608.03983, 2016.

4.  Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. 
arXiv preprint arXiv:1609.04836, 2016

5. Gao Huang and Yixuan Li. Snapshots Ensemble: Train 1, get M for free. arXiv preprint arXiv:1704.00109



Comments

Popular posts from this blog

Calculating camera extrincs

Before we talk about the projection matrix of the depth correspondces, we need to know two things: - Camera extrinsics - Camera intrinsics Camera extrinsics maps the world coorinates to the camera coordinates. For the simplicity of the camera, it is a pinhole camera without lenses.  I'll talk about the lenses, the focal length, the lense aberation, the pixel sensor dimension, etc in Camera intrincs. So, locating an object in two images and projecting in the camera space is not that straight. But, it will be a straight process with the application of Machine Learning. I'll talk about the next part of the series in applying the deep neural network to optimizing the homographic projection and have it robust in low texture settings including low light. Deep Neural Network - Estimating Homography to address: - low texture environment - outside light conditions ( gamma > 2kLs) - robust as or better than SfM or other SLAM techniquese First, we need to locate the ...

How to improve the traditional ASR using Connectionist Temporal Classification

The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate.  At this rate, ASR users are often frustrated with the experience with using such a system. The tradition ASR is often fragile: 1) requires extensive modification of parameters, just to make it work. 2) requires extensive understanding of a language model and a acoustic model. 3) doesn't scale well to multiple languages. 4) hyper-sensitive to speaker variants. Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy. What if, we can do a DL from end to end? Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz.  At each spectrogram interval, a DL neural network can be assigned, individually. The basic idea is to have RNN output neurons to encode distribution over "symbols". The traditional ASR uses a phone...

How to use Convolution Neural Network to predict SIFT features

A feature locator is essential in all CV domain.  It's the basis of the germetric transformation, epipolar geometry, to 3D mesh reconstruction. Many techniques - SIFT and other SLAM technologies, are available, but they require ideal environments to work in. To address the short comings: - sensitive to low texture environment - sensitive to low light envonrment - sensitive to high light environment (like outdoor day light with above 20k lux) - and many other issues I propose a CNN based neural network to detect 4 correspondences in an image A and an image B. Since it is tricky to have a neural network to predict a 4x4 affine matrix of rotation and translation, I separated the translation vector from the rotation vector. Basically, the ground truth data will be precalcalated with a generic SIFT with RANSAC to calculate the correspondences set P and P'. The L2 (Eucledean) distance will be used between a predicted value.  They are 4 points, so an averaged will ...