Public Datasets for SLAM

TUM RGB-D benchmark [38]
- an excellent dataset to evaluate the accuracy of camera location
- several sequences with accurate ground truth obtained with an external motion capture system

KITTI
- extracted 2000 corners
- 512x384
- 752x480
- 1241x376
- 5 corners per cell

Compute orientation & ORB descriptors
- novel, direct, semi-dense, LSD-SLAM [10]
- takes time to converge the depth values
- PTAM benchmarks [4]
- manually selected two keyframes for initialization

- align the keyframe trajectories using the similarity transformatione
- scale is unknown
- measure the absolute trajectory error (ATE) [38]

- RGB-D SLAM [43]
- trajectories

- use the similarity transform to check if the scale is well recovered.
- align the trajectories with a rigid body transformation

Comments

How to use Convolution Neural Network to predict SIFT features

A feature locator is essential in all CV domain. It's the basis of the germetric transformation, epipolar geometry, to 3D mesh reconstruction. Many techniques - SIFT and other SLAM technologies, are available, but they require ideal environments to work in. To address the short comings: - sensitive to low texture environment - sensitive to low light envonrment - sensitive to high light environment (like outdoor day light with above 20k lux) - and many other issues I propose a CNN based neural network to detect 4 correspondences in an image A and an image B. Since it is tricky to have a neural network to predict a 4x4 affine matrix of rotation and translation, I separated the translation vector from the rotation vector. Basically, the ground truth data will be precalcalated with a generic SIFT with RANSAC to calculate the correspondences set P and P'. The L2 (Eucledean) distance will be used between a predicted value. They are 4 points, so an averaged will ...

How to project a camera plane A to a camera plane B

How to Create a holographic display and camcorder In the last part of the series "How to Create a Holographic Display and Camcorder", I talked about what the interest points, descriptors, and features to find the same object in two photos. In this part of the series, I'll talk about how to extract the depth of the object in two photos by calculating the disparity between the photos. In order to that, we need to construct a triangle mesh between correspondences. To construct a mesh, we will use Delaunnay triagulation. Delaunnay Triagulation - It minimizes angles of all triangles, while the sigma of triangles is maximized. The reason for the triangulation is to do a piece wise affine transformation for each triangle mapped from a projective plane A to a projective plane B. A projective plane A is of a camera projective view at time t, while a projective plane B is of a camera projective view at time t+1. (or, at t-1. It really doesn't matter)...

How to improve the traditional ASR using Connectionist Temporal Classification

The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate. At this rate, ASR users are often frustrated with the experience with using such a system. The tradition ASR is often fragile: 1) requires extensive modification of parameters, just to make it work. 2) requires extensive understanding of a language model and a acoustic model. 3) doesn't scale well to multiple languages. 4) hyper-sensitive to speaker variants. Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy. What if, we can do a DL from end to end? Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz. At each spectrogram interval, a DL neural network can be assigned, individually. The basic idea is to have RNN output neurons to encode distribution over "symbols". The traditional ASR uses a phone...

Artificial Intelligence and Machine Vision

Search This Blog