Skip to main content

How to train a neural network to retrieve 3D maps from videos

This blog is about how to train a neural network to extract depth maps from videos of moving people captured with a monocular camera.

Note:
With a monocular camera, extracting the depth map of moving people is difficult.  Difficulty is due to the motion blur and the rolling shutter of an image. 

However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos.

This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS.

To start this project with a SfM and a MVS, we will use TUM Dataset.

So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training.

The RGB-D SLAM reference implementation from these papers are used:
- RGB-D Slam (Robotics OS)
- Real-time 3D Visual SLAM with a hand-held RGB-D Slam (2011)
- An Evaluation of the RGB-D Slam. (2012)


Steps:
1) Estimate camera pose
2) Compute Dense depth with a MVS.
3) Select keyframes

Note: Calculate a L2 distance between the optical centers between frames.  These frames should be overlapped at minimum τ or less than 0.6 ratio of features.

So dispose paired frames of features less than 0.6 at the frame window interval at 10.
So the pair frames should not exist within 10 neighboring frames from either end of time t.

4) Generate confidence map C
5) Calculate Losses
    note: use scale-invariant depth regression loss



Estimate Camera Pose:

1) Identify trackable sequences in each low resolution video using ORB-SLAM.
2) Estimate an initial camera pose for each frame ORB-SLAM.
3) Re-process the high resolution videos using a visual SfM.
 
    note:
    a) This reprocess refines the initial camera pose and intrinsic parameters.
    b) extracts & matches features across frames.
    c) global bundle adjustment optimization.

4) remove non-smooth sequenced camera motions.
 



Compute Dense Depth Maps with a MVS:

- reconstruct each scene's dense geometry using a MVS.
   (e.g. https://colmap.github.io/)
- filter outlier depths using the depth refinement method.
- remove erroneous depth values.

   note: for each frame, compute a normalized error  Δ(p) for every valid pixel ρ.  The normalized error σ > 0.2 should be removed.

- estimate optical flow between a reference image and a source image.
- compute an intial depth map from the estimated flow field.


Using TUM RGBD Dataset

- Before running our model, we should estimate camera poses using ORB-SLAM2.
- Some footages have severe motion blur and rolling shutter effects.  This causes incorrect pose estimation.
- Then, filter these failures by inspecting the camera trajectory and sparse map.
- After training and validation, we can compare the model's depth prediction with NYU and DIW datasets.

Quantitative Comparisons

Using Scale-Invariant error measures and RMSE, the comparisons between GT depth and the predicted depth by the model are made.

The model is superior in terms of accuracy, but the accuracy error RMSE is at around 0.377.  The right most column shows the depth map prediction based on the model.









Code
The code of the neural network model is open sourced at the following github repository.

https://github.com/dparksports



References
. [1]  F.Bogo,A.Kanazawa,C.Lassner,P.V.Gehler,J.Romero, and M. J. Black. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii 


. [2]  A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Int. Conf. on 3D Vision (3DV), 2017. ii 


. [3]  W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Neural Information Processing Systems, pages 730–738, 2016. ii, iii, vii, viii 


. [4]  A.Dai,A.X.Chang,M.Savva,M.Halber,T.A.Funkhouser, and M. Niessner. ScanNet: Richly-annotated 3D reconstruc- tions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [5]  M. Dou, S. Khamis, Y. Degtyarev, P. L. Davidson, S. R. Fanello, A. Kowdle, S. Orts, C. Rhemann, D. Kim, J. Tay- lor, P. Kohli, V. Tankovich, and S. Izadi. Fusion4D: real- time performance capture of challenging scenes. ACM Trans. Graphics, 35:114:1–114:13, 2016. ii 


. [6]  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems, pages 2366–2374, 2014. ii, v 


. [8]  C. Godard, O. M. Aodha, and G. J. Brostow. Unsuper- vised monocular depth estimation with left-right consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [9]  R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. iv 


. [10]  I. P. Howard. Seeing in depth, Vol. 1: Basic mechanisms. 
University of Toronto Press, 2002. i 


. [11]  E.Ilg,N.Mayer,T.Saikia,M.Keuper,A.Dosovitskiy,and T. Brox. FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks. In Proc. Computer Vision and Pattern 
Recognition (CVPR), 2017. iv 


. [12]  M. Innmann, M. Zollho ̈fer, M. Niessner, C. Theobalt, and 
M. Stamminger. VolumeDeform: Real-time volumetric non- rigid reconstruction. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii 


. [13]  M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. In Proc. European Conf. on Computer Vision (ECCV), pages 17–30, 1996. iv 


. [14]  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. on 3D Vision (3DV), pages 239–248, 2016. ii, vi, vii

. [15] C.Lassner,J.Romero,M.Kiefel,F.Bogo,M.J.Black,and P. V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii

. [16] O. Mees, A. Eitel, and W. Burgard. Choosing Smartly: Adap- tive Multimodal Fusion for Object Detection in Changing Environments. In Int. Conf. on Intelligent Robots and Systems (IROS), 2016. ii

. [17] D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real- time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graphics, 36:44:1–44:14, 2017. ii

. [18] R. Mur-Artal and J. D. Tardo ́s. Orb-Slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. ii

. [19] R. A. Newcombe, D. Fox, and S. M. Seitz. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. ii

. [20] B. Ni, G. Wang, and P. Moulin. RGBD-HuDaAct: A color- depth video database for human daily activity recognition. In Proc. ICCV Workshops, 2011. ii

. [21] H. S. Park, T. Shiratori, I. A. Matthews, and Y. Sheikh. 3D Reconstruction of a Moving Point from a Series of 2D Projec- tions. In Proc. European Conf. on Computer Vision (ECCV), 2010. ii

. [22] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D hu- man pose. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii

. [23] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monoc- ular depth estimation in complex dynamic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii

. [24] C. Russell, R. Yu, and L. Agapito. Video pop-up: Monocular 3d reconstruction of dynamic scenes. In Proc. European Conf. on Computer Vision (ECCV), pages 583–598, 2014. ii, vii

. [25] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii

. [26] J. L. Scho ̈nberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In Proc. European Conf. on
Computer Vision (ECCV), pages 501–518, 2016. ii, iii, iv

. [27]  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [28]  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. European Conf. on Computer Vision (ECCV), 2012. ii 


. [29]  T. Simon, J. Valmadre, I. A. Matthews, and Y. Sheikh. Kronecker-Markov Prior for Dynamic 3D Reconstruction. Trans. Pattern Analysis and Machine Intelligence, 39:2201– 2214, 2017. ii 


. [30]  S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii 


. [31]  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pages 573–580, 2012. vi 


. [32]  B.Ummenhofer,H.Zhou,J.Uhrig,N.Mayer,E.Ilg,A.Doso- vitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii, vi, vii, viii 


. [33]  M. Vo, S. G. Narasimhan, and Y. Sheikh. Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction. Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii 


. [34]  J. Wulff, L. Sevilla-Lara, and M. J. Black. Optical flow in mostly rigid scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. iv 


. [35]  J.Xiao,A.Owens,andA.Torralba.Sun3D:Adatabaseofbig spaces reconstructed using sfm and object labels. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1625–1632, 2013. ii 


. [36]  M. Ye and R. Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proc. Computer Vision and Pattern Recognition (CVPR), 2014. ii 


. [37]  E. Zheng, D. Ji, E. Dunn, and J.-M. Frahm. Sparse Dynamic 3D Reconstruction from Unsynchronized Videos. Proc. Int. Conf. on Computer Vision (ICCV), pages 4435–4443, 2015. ii 


 . [38] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper- vised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii

  . [39] Y. Zhu, W. Chen, and G. Guo. Evaluating spatiotemporal in- terest point features for depth-based action recognition. Image and Vision Computing, 32(8):453–464, 2014. ii

   . [40] M. Zollho ̈fer, M. Niessner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graphics, 33(4):156, 2014. ii


Comments

Popular posts from this blog

How to project a camera plane A to a camera plane B

How to Create a holographic display and camcorder In the last part of the series "How to Create a Holographic Display and Camcorder", I talked about what the interest points, descriptors, and features to find the same object in two photos. In this part of the series, I'll talk about how to extract the depth of the object in two photos by calculating the disparity between the photos. In order to that, we need to construct a triangle mesh between correspondences. To construct a mesh, we will use Delaunnay triagulation.  Delaunnay Triagulation - It minimizes angles of all triangles, while the sigma of triangles is maximized. The reason for the triangulation is to do a piece wise affine transformation for each triangle mapped from a projective plane A to a projective plane B. A projective plane A is of a camera projective view at time t, while a projective plane B is of a camera projective view at time t+1. (or, at t-1.  It really doesn't matter)...

How to create a holographic camcorder

Since the invention of a camcorder, we haven't seen much of advancement of a video camcorder. Sure, there are few interesting, new features like capturing video in 360 or taking high resolution 4K content. But the content is still in 2D and we still watch it on a 2D display. Have you seen the movie Minority Report (2002)? There is a scene where Tom Cruise is watching a video recording of his lost son in 3D or holographically. Here is a video clip of this scene. I have been waiting for the technological advancement to do this, but it's not here yet. So I decided to build one myself. In order to build a holographic video camcorder, we need two devices. 1) a video recorder - a recorder which captures the video content in 3D or holographically. 2) a video display - a display device which shows the recorded holographic content in 3D or holographically. Do we have a technology to record a video, holographically. Yes, we can now do it, and I'll e...

Creating an optical computer

Creating an optical computer  Note on creating an optical computer.  What is Optical Computer? A laptop is a microchip based computer and uses electricity and transisters to compute. An optical computer uses photons to compute.  How does it compare to a typical laptop? A modern desktop computer has about 5 TFLOPS (5 x 10^16 floating calculations per second). With an optical computer, there is no limit in the calcuations per second.   Is an optical computer faster than a quantuam computer?  In 2016, the fastest known quantum computer has 2000 qubits, which is 1000 faster than 512 qubits.  With an optical computer, there is no artificial limitation like 2000 or 500 qubits.   What's the theoretical compute limit on an optical computer?  There is a limit of speed of light. For now, the only artificial limitation is how we design the first prototype.  How much electricity energy does it require?  The first POC should use less than 1...