How to train a neural network to retrieve 3D maps from videos

This blog is about how to train a neural network to extract depth maps from videos of moving people captured with a monocular camera.

With a monocular camera, extracting the depth map of moving people is difficult.  Difficulty is due to the motion blur and the rolling shutter of an image. 

However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos.

This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS.

To start this project with a SfM and a MVS, we will use TUM Dataset.

So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training.

The RGB-D SLAM reference implementation from these papers are used:
- RGB-D Slam (Robotics OS)
- Real-time 3D Visual SLAM with a hand-held RGB-D Slam (2011)
- An Evaluation of the RGB-D Slam. (2012)

1) Estimate camera pose
2) Compute Dense depth with a MVS.
3) Select keyframes

Note: Calculate a L2 distance between the optical centers between frames.  These frames should be overlapped at minimum τ or less than 0.6 ratio of features.

So dispose paired frames of features less than 0.6 at the frame window interval at 10.
So the pair frames should not exist within 10 neighboring frames from either end of time t.

4) Generate confidence map C
5) Calculate Losses
    note: use scale-invariant depth regression loss

Estimate Camera Pose:

1) Identify trackable sequences in each low resolution video using ORB-SLAM.
2) Estimate an initial camera pose for each frame ORB-SLAM.
3) Re-process the high resolution videos using a visual SfM.
    a) This reprocess refines the initial camera pose and intrinsic parameters.
    b) extracts & matches features across frames.
    c) global bundle adjustment optimization.

4) remove non-smooth sequenced camera motions.

Compute Dense Depth Maps with a MVS:

- reconstruct each scene's dense geometry using a MVS.
- filter outlier depths using the depth refinement method.
- remove erroneous depth values.

   note: for each frame, compute a normalized error  Δ(p) for every valid pixel ρ.  The normalized error σ > 0.2 should be removed.

- estimate optical flow between a reference image and a source image.
- compute an intial depth map from the estimated flow field.

Using TUM RGBD Dataset

- Before running our model, we should estimate camera poses using ORB-SLAM2.
- Some footages have severe motion blur and rolling shutter effects.  This causes incorrect pose estimation.
- Then, filter these failures by inspecting the camera trajectory and sparse map.
- After training and validation, we can compare the model's depth prediction with NYU and DIW datasets.

Quantitative Comparisons

Using Scale-Invariant error measures and RMSE, the comparisons between GT depth and the predicted depth by the model are made.

The model is superior in terms of accuracy, but the accuracy error RMSE is at around 0.377.  The right most column shows the depth map prediction based on the model.

The code of the neural network model is open sourced at the following github repository.

