Skip to main content

How to train a neural network to retrieve 3D maps from videos

This blog is about how to train a neural network to extract depth maps from videos of moving people captured with a monocular camera.

Note:
With a monocular camera, extracting the depth map of moving people is difficult.  Difficulty is due to the motion blur and the rolling shutter of an image. 

However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos.

This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS.

To start this project with a SfM and a MVS, we will use TUM Dataset.

So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training.

The RGB-D SLAM reference implementation from these papers are used:
- RGB-D Slam (Robotics OS)
- Real-time 3D Visual SLAM with a hand-held RGB-D Slam (2011)
- An Evaluation of the RGB-D Slam. (2012)


Steps:
1) Estimate camera pose
2) Compute Dense depth with a MVS.
3) Select keyframes

Note: Calculate a L2 distance between the optical centers between frames.  These frames should be overlapped at minimum τ or less than 0.6 ratio of features.

So dispose paired frames of features less than 0.6 at the frame window interval at 10.
So the pair frames should not exist within 10 neighboring frames from either end of time t.

4) Generate confidence map C
5) Calculate Losses
    note: use scale-invariant depth regression loss



Estimate Camera Pose:

1) Identify trackable sequences in each low resolution video using ORB-SLAM.
2) Estimate an initial camera pose for each frame ORB-SLAM.
3) Re-process the high resolution videos using a visual SfM.
 
    note:
    a) This reprocess refines the initial camera pose and intrinsic parameters.
    b) extracts & matches features across frames.
    c) global bundle adjustment optimization.

4) remove non-smooth sequenced camera motions.
 



Compute Dense Depth Maps with a MVS:

- reconstruct each scene's dense geometry using a MVS.
   (e.g. https://colmap.github.io/)
- filter outlier depths using the depth refinement method.
- remove erroneous depth values.

   note: for each frame, compute a normalized error  Δ(p) for every valid pixel ρ.  The normalized error σ > 0.2 should be removed.

- estimate optical flow between a reference image and a source image.
- compute an intial depth map from the estimated flow field.


Using TUM RGBD Dataset

- Before running our model, we should estimate camera poses using ORB-SLAM2.
- Some footages have severe motion blur and rolling shutter effects.  This causes incorrect pose estimation.
- Then, filter these failures by inspecting the camera trajectory and sparse map.
- After training and validation, we can compare the model's depth prediction with NYU and DIW datasets.

Quantitative Comparisons

Using Scale-Invariant error measures and RMSE, the comparisons between GT depth and the predicted depth by the model are made.

The model is superior in terms of accuracy, but the accuracy error RMSE is at around 0.377.  The right most column shows the depth map prediction based on the model.









Code
The code of the neural network model is open sourced at the following github repository.

https://github.com/dparksports



References
. [1]  F.Bogo,A.Kanazawa,C.Lassner,P.V.Gehler,J.Romero, and M. J. Black. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii 


. [2]  A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Int. Conf. on 3D Vision (3DV), 2017. ii 


. [3]  W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Neural Information Processing Systems, pages 730–738, 2016. ii, iii, vii, viii 


. [4]  A.Dai,A.X.Chang,M.Savva,M.Halber,T.A.Funkhouser, and M. Niessner. ScanNet: Richly-annotated 3D reconstruc- tions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [5]  M. Dou, S. Khamis, Y. Degtyarev, P. L. Davidson, S. R. Fanello, A. Kowdle, S. Orts, C. Rhemann, D. Kim, J. Tay- lor, P. Kohli, V. Tankovich, and S. Izadi. Fusion4D: real- time performance capture of challenging scenes. ACM Trans. Graphics, 35:114:1–114:13, 2016. ii 


. [6]  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems, pages 2366–2374, 2014. ii, v 


. [8]  C. Godard, O. M. Aodha, and G. J. Brostow. Unsuper- vised monocular depth estimation with left-right consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [9]  R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. iv 


. [10]  I. P. Howard. Seeing in depth, Vol. 1: Basic mechanisms. 
University of Toronto Press, 2002. i 


. [11]  E.Ilg,N.Mayer,T.Saikia,M.Keuper,A.Dosovitskiy,and T. Brox. FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks. In Proc. Computer Vision and Pattern 
Recognition (CVPR), 2017. iv 


. [12]  M. Innmann, M. Zollho ̈fer, M. Niessner, C. Theobalt, and 
M. Stamminger. VolumeDeform: Real-time volumetric non- rigid reconstruction. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii 


. [13]  M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. In Proc. European Conf. on Computer Vision (ECCV), pages 17–30, 1996. iv 


. [14]  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. on 3D Vision (3DV), pages 239–248, 2016. ii, vi, vii

. [15] C.Lassner,J.Romero,M.Kiefel,F.Bogo,M.J.Black,and P. V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii

. [16] O. Mees, A. Eitel, and W. Burgard. Choosing Smartly: Adap- tive Multimodal Fusion for Object Detection in Changing Environments. In Int. Conf. on Intelligent Robots and Systems (IROS), 2016. ii

. [17] D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real- time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graphics, 36:44:1–44:14, 2017. ii

. [18] R. Mur-Artal and J. D. Tardo ́s. Orb-Slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. ii

. [19] R. A. Newcombe, D. Fox, and S. M. Seitz. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. ii

. [20] B. Ni, G. Wang, and P. Moulin. RGBD-HuDaAct: A color- depth video database for human daily activity recognition. In Proc. ICCV Workshops, 2011. ii

. [21] H. S. Park, T. Shiratori, I. A. Matthews, and Y. Sheikh. 3D Reconstruction of a Moving Point from a Series of 2D Projec- tions. In Proc. European Conf. on Computer Vision (ECCV), 2010. ii

. [22] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D hu- man pose. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii

. [23] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monoc- ular depth estimation in complex dynamic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii

. [24] C. Russell, R. Yu, and L. Agapito. Video pop-up: Monocular 3d reconstruction of dynamic scenes. In Proc. European Conf. on Computer Vision (ECCV), pages 583–598, 2014. ii, vii

. [25] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii

. [26] J. L. Scho ̈nberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In Proc. European Conf. on
Computer Vision (ECCV), pages 501–518, 2016. ii, iii, iv

. [27]  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [28]  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. European Conf. on Computer Vision (ECCV), 2012. ii 


. [29]  T. Simon, J. Valmadre, I. A. Matthews, and Y. Sheikh. Kronecker-Markov Prior for Dynamic 3D Reconstruction. Trans. Pattern Analysis and Machine Intelligence, 39:2201– 2214, 2017. ii 


. [30]  S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [31]  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pages 573–580, 2012. vi 


. [32]  B.Ummenhofer,H.Zhou,J.Uhrig,N.Mayer,E.Ilg,A.Doso- vitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii, vi, vii, viii 


. [33]  M. Vo, S. G. Narasimhan, and Y. Sheikh. Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction. Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii 


. [34]  J. Wulff, L. Sevilla-Lara, and M. J. Black. Optical flow in mostly rigid scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. iv 


. [35]  J.Xiao,A.Owens,andA.Torralba.Sun3D:Adatabaseofbig spaces reconstructed using sfm and object labels. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1625–1632, 2013. ii 


. [36]  M. Ye and R. Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proc. Computer Vision and Pattern Recognition (CVPR), 2014. ii 


. [37]  E. Zheng, D. Ji, E. Dunn, and J.-M. Frahm. Sparse Dynamic 3D Reconstruction from Unsynchronized Videos. Proc. Int. Conf. on Computer Vision (ICCV), pages 4435–4443, 2015. ii 


 . [38] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper- vised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii

  . [39] Y. Zhu, W. Chen, and G. Guo. Evaluating spatiotemporal in- terest point features for depth-based action recognition. Image and Vision Computing, 32(8):453–464, 2014. ii

   . [40] M. Zollho ̈fer, M. Niessner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graphics, 33(4):156, 2014. ii


Comments

Popular posts from this blog

How to improve the traditional ASR using Connectionist Temporal Classification

The traditional Automatic Speech Recognition (ASR) performs at about 85% accuracy rate.  At this rate, ASR users are often frustrated with the experience with using such a system. The tradition ASR is often fragile: 1) requires extensive modification of parameters, just to make it work. 2) requires extensive understanding of a language model and a acoustic model. 3) doesn't scale well to multiple languages. 4) hyper-sensitive to speaker variants. Deep Learning on the acoustic model has been introduced, but not much of gain in the accuracy. What if, we can do a DL from end to end? Connectionist Temporal Classification (2006) introduces an idea of using FFT on the frequency of a recording of a voice command and constructs a spectrogram at 8kHz.  At each spectrogram interval, a DL neural network can be assigned, individually. The basic idea is to have RNN output neurons to encode distribution over "symbols". The traditional ASR uses a phone...

How to use Convolution Neural Network to predict SIFT features

A feature locator is essential in all CV domain.  It's the basis of the germetric transformation, epipolar geometry, to 3D mesh reconstruction. Many techniques - SIFT and other SLAM technologies, are available, but they require ideal environments to work in. To address the short comings: - sensitive to low texture environment - sensitive to low light envonrment - sensitive to high light environment (like outdoor day light with above 20k lux) - and many other issues I propose a CNN based neural network to detect 4 correspondences in an image A and an image B. Since it is tricky to have a neural network to predict a 4x4 affine matrix of rotation and translation, I separated the translation vector from the rotation vector. Basically, the ground truth data will be precalcalated with a generic SIFT with RANSAC to calculate the correspondences set P and P'. The L2 (Eucledean) distance will be used between a predicted value.  They are 4 points, so an averaged will ...