Skip to main content

Finding a better local minima in Deep Learning

Training a DL model to find a local minima in n-dimensions can be a challenge.  Often, data scientists and ML engineers would use a gradient descent to optimize the path.

Starting delta may be anywhere between 1e-3 or 1e-4.  Having a constant gridient would not fast-approach a local minima.

There are few issues with this approach.

1) The first found local minima may not be the best minima.  It can be stuck in a sharp valley, where any deriviate change would raise the error rate above 50% or more.

2) The first found local minima may be a local mixima, as shown in the saddle point graph below.






When optimizing on n-th dimensions of space of a DL model, the best approach is to find a flat valley, when the SGD can locate a stable ground and where error rates stay low or relatively small to what it landed in the best optimization.

However, there are a better way than this.

Instead of manually entering an initial gradient decent value and updating it every epoch or mini-batch, why don't we use a cyclically variant gradient decent?



Here, the GD value actually follows a value of half cosine for the initial mini-batch.  The GD value changes, only when the validation set error doesn't change much.

The benefit of using this cyclic path of the learning rate is to kick the stuck optimized GD out of a sharp valley, so the error rate stays stable, as the learning rate stabilizes.


The better approach, is then, to accelerate the learning rate in a shorter cosine path, and de-accelerate the learning rate in a longer cosine path, as shown below.   This ensures the SGD can land on the flat valley in n-dimension space, since the number of dimension exponentially increase the number of potential local minima to explore.








References


1. Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. arXiv preprint arXiv:1506.01186

2. Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization.
arXiv preprint arXiv:1502.04390, 2015.

3 I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.
arXiv preprint arXiv:1608.03983, 2016.

4.  Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. 
arXiv preprint arXiv:1609.04836, 2016

5. Gao Huang and Yixuan Li. Snapshots Ensemble: Train 1, get M for free. arXiv preprint arXiv:1704.00109



Comments

Popular posts from this blog

How to project a camera plane A to a camera plane B

How to Create a holographic display and camcorder In the last part of the series "How to Create a Holographic Display and Camcorder", I talked about what the interest points, descriptors, and features to find the same object in two photos. In this part of the series, I'll talk about how to extract the depth of the object in two photos by calculating the disparity between the photos. In order to that, we need to construct a triangle mesh between correspondences. To construct a mesh, we will use Delaunnay triagulation.  Delaunnay Triagulation - It minimizes angles of all triangles, while the sigma of triangles is maximized. The reason for the triangulation is to do a piece wise affine transformation for each triangle mapped from a projective plane A to a projective plane B. A projective plane A is of a camera projective view at time t, while a projective plane B is of a camera projective view at time t+1. (or, at t-1.  It really doesn't matter)...

State of the Art SLAM techniques

Best Stereo SLAMs in 2017 are reviewed. Namely, (in arbitrary order) EKF-SLAM based,  Keyframe based,  Joint BA optimization based,  RSLAM,  S-PTAM,  LSD-SLAM,   Best RGB-D SLAMs in 2017 are also reviewed. KinectFusion,  Kintinuouns,  DVO-SLAM,  ElasticFusion,  RGB-D SLAM,   See my keypoints of the best Stereo SLAMs. Stereo SLAM Conditionally Independent Divide and Conquer EKF-SLAM [5]   operate in large environments than other approaches at that time uses both  close and far points far points whose depth cannot be reliably estimated due to little disparity in the stereo camera  uses an inverse depth parametrization [6] shows empirically points can be triangulated reliably, if their depth is less than about 40 times the stereo baseline.     - Keyframe-based  Stereo SLAM   - uses BA optimization in a local area to archive scalability.  ...

How to create a holographic camcorder

Since the invention of a camcorder, we haven't seen much of advancement of a video camcorder. Sure, there are few interesting, new features like capturing video in 360 or taking high resolution 4K content. But the content is still in 2D and we still watch it on a 2D display. Have you seen the movie Minority Report (2002)? There is a scene where Tom Cruise is watching a video recording of his lost son in 3D or holographically. Here is a video clip of this scene. I have been waiting for the technological advancement to do this, but it's not here yet. So I decided to build one myself. In order to build a holographic video camcorder, we need two devices. 1) a video recorder - a recorder which captures the video content in 3D or holographically. 2) a video display - a display device which shows the recorded holographic content in 3D or holographically. Do we have a technology to record a video, holographically. Yes, we can now do it, and I'll e...