This blog is about how to train a neural network to extract depth maps from videos of moving people captured with a monocular camera.
Note:
With a monocular camera, extracting the depth map of moving people is difficult. Difficulty is due to the motion blur and the rolling shutter of an image.
However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos.
This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS.
To start this project with a SfM and a MVS, we will use TUM Dataset.
So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training.
The RGB-D SLAM reference implementation from these papers are used:
- RGB-D Slam (Robotics OS)
- Real-time 3D Visual SLAM with a hand-held RGB-D Slam (2011)
- An Evaluation of the RGB-D Slam. (2012)
Steps:
1) Estimate camera pose
2) Compute Dense depth with a MVS.
3) Select keyframes
Note: Calculate a L2 distance between the optical centers between frames. These frames should be overlapped at minimum τ or less than 0.6 ratio of features.
So dispose paired frames of features less than 0.6 at the frame window interval at 10.
So the pair frames should not exist within 10 neighboring frames from either end of time t.
4) Generate confidence map C
5) Calculate Losses
note: use scale-invariant depth regression loss
Estimate Camera Pose:
1) Identify trackable sequences in each low resolution video using ORB-SLAM.
2) Estimate an initial camera pose for each frame ORB-SLAM.
3) Re-process the high resolution videos using a visual SfM.
note:
a) This reprocess refines the initial camera pose and intrinsic parameters.
b) extracts & matches features across frames.
c) global bundle adjustment optimization.
4) remove non-smooth sequenced camera motions.
Compute Dense Depth Maps with a MVS:
- reconstruct each scene's dense geometry using a MVS.
(e.g. https://colmap.github.io/)
- filter outlier depths using the depth refinement method.
- remove erroneous depth values.
note: for each frame, compute a normalized error Δ(p) for every valid pixel ρ. The normalized error σ > 0.2 should be removed.
- estimate optical flow between a reference image and a source image.
- compute an intial depth map from the estimated flow field.
Using TUM RGBD Dataset
- Before running our model, we should estimate camera poses using ORB-SLAM2.
- Some footages have severe motion blur and rolling shutter effects. This causes incorrect pose estimation.
- Then, filter these failures by inspecting the camera trajectory and sparse map.
- After training and validation, we can compare the model's depth prediction with NYU and DIW datasets.
Quantitative Comparisons
Using Scale-Invariant error measures and RMSE, the comparisons between GT depth and the predicted depth by the model are made.
The model is superior in terms of accuracy, but the accuracy error RMSE is at around 0.377. The right most column shows the depth map prediction based on the model.
Code
The code of the neural network model is open sourced at the following github repository.
https://github.com/dparksports
References
. [1] F.Bogo,A.Kanazawa,C.Lassner,P.V.Gehler,J.Romero, and M. J. Black. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii
. [2] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Int. Conf. on 3D Vision (3DV), 2017. ii
. [3] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Neural Information Processing Systems, pages 730–738, 2016. ii, iii, vii, viii
. [4] A.Dai,A.X.Chang,M.Savva,M.Halber,T.A.Funkhouser, and M. Niessner. ScanNet: Richly-annotated 3D reconstruc- tions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [5] M. Dou, S. Khamis, Y. Degtyarev, P. L. Davidson, S. R. Fanello, A. Kowdle, S. Orts, C. Rhemann, D. Kim, J. Tay- lor, P. Kohli, V. Tankovich, and S. Izadi. Fusion4D: real- time performance capture of challenging scenes. ACM Trans. Graphics, 35:114:1–114:13, 2016. ii
. [6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems, pages 2366–2374, 2014. ii, v
. [8] C. Godard, O. M. Aodha, and G. J. Brostow. Unsuper- vised monocular depth estimation with left-right consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [9] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. iv
. [10] I. P. Howard. Seeing in depth, Vol. 1: Basic mechanisms. University of Toronto Press, 2002. i
. [11] E.Ilg,N.Mayer,T.Saikia,M.Keuper,A.Dosovitskiy,and T. Brox. FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. iv
. [12] M. Innmann, M. Zollho ̈fer, M. Niessner, C. Theobalt, and M. Stamminger. VolumeDeform: Real-time volumetric non- rigid reconstruction. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii
. [13] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. In Proc. European Conf. on Computer Vision (ECCV), pages 17–30, 1996. iv
. [14] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. on 3D Vision (3DV), pages 239–248, 2016. ii, vi, vii
. [15] C.Lassner,J.Romero,M.Kiefel,F.Bogo,M.J.Black,and P. V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [16] O. Mees, A. Eitel, and W. Burgard. Choosing Smartly: Adap- tive Multimodal Fusion for Object Detection in Changing Environments. In Int. Conf. on Intelligent Robots and Systems (IROS), 2016. ii
. [17] D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real- time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graphics, 36:44:1–44:14, 2017. ii
. [18] R. Mur-Artal and J. D. Tardo ́s. Orb-Slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. ii
. [19] R. A. Newcombe, D. Fox, and S. M. Seitz. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. ii
. [20] B. Ni, G. Wang, and P. Moulin. RGBD-HuDaAct: A color- depth video database for human daily activity recognition. In Proc. ICCV Workshops, 2011. ii
. [21] H. S. Park, T. Shiratori, I. A. Matthews, and Y. Sheikh. 3D Reconstruction of a Moving Point from a Series of 2D Projec- tions. In Proc. European Conf. on Computer Vision (ECCV), 2010. ii
. [22] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D hu- man pose. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [23] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monoc- ular depth estimation in complex dynamic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii
. [24] C. Russell, R. Yu, and L. Agapito. Video pop-up: Monocular 3d reconstruction of dynamic scenes. In Proc. European Conf. on Computer Vision (ECCV), pages 583–598, 2014. ii, vii
. [25] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii
. [26] J. L. Scho ̈nberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In Proc. European Conf. on
Computer Vision (ECCV), pages 501–518, 2016. ii, iii, iv
. [27] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [28] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. European Conf. on Computer Vision (ECCV), 2012. ii
. [29] T. Simon, J. Valmadre, I. A. Matthews, and Y. Sheikh. Kronecker-Markov Prior for Dynamic 3D Reconstruction. Trans. Pattern Analysis and Machine Intelligence, 39:2201– 2214, 2017. ii
. [30] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [31] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pages 573–580, 2012. vi
. [32] B.Ummenhofer,H.Zhou,J.Uhrig,N.Mayer,E.Ilg,A.Doso- vitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii, vi, vii, viii
. [33] M. Vo, S. G. Narasimhan, and Y. Sheikh. Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction. Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii
. [34] J. Wulff, L. Sevilla-Lara, and M. J. Black. Optical flow in mostly rigid scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. iv
. [35] J.Xiao,A.Owens,andA.Torralba.Sun3D:Adatabaseofbig spaces reconstructed using sfm and object labels. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1625–1632, 2013. ii
. [36] M. Ye and R. Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proc. Computer Vision and Pattern Recognition (CVPR), 2014. ii
. [37] E. Zheng, D. Ji, E. Dunn, and J.-M. Frahm. Sparse Dynamic 3D Reconstruction from Unsynchronized Videos. Proc. Int. Conf. on Computer Vision (ICCV), pages 4435–4443, 2015. ii
. [38] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper- vised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [39] Y. Zhu, W. Chen, and G. Guo. Evaluating spatiotemporal in- terest point features for depth-based action recognition. Image and Vision Computing, 32(8):453–464, 2014. ii
. [40] M. Zollho ̈fer, M. Niessner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graphics, 33(4):156, 2014. ii
Note:
With a monocular camera, extracting the depth map of moving people is difficult. Difficulty is due to the motion blur and the rolling shutter of an image.
However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos.
This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS.
To start this project with a SfM and a MVS, we will use TUM Dataset.
So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training.
The RGB-D SLAM reference implementation from these papers are used:
- RGB-D Slam (Robotics OS)
- Real-time 3D Visual SLAM with a hand-held RGB-D Slam (2011)
- An Evaluation of the RGB-D Slam. (2012)
Steps:
1) Estimate camera pose
2) Compute Dense depth with a MVS.
3) Select keyframes
Note: Calculate a L2 distance between the optical centers between frames. These frames should be overlapped at minimum τ or less than 0.6 ratio of features.
So dispose paired frames of features less than 0.6 at the frame window interval at 10.
So the pair frames should not exist within 10 neighboring frames from either end of time t.
4) Generate confidence map C
5) Calculate Losses
note: use scale-invariant depth regression loss
Estimate Camera Pose:
1) Identify trackable sequences in each low resolution video using ORB-SLAM.
2) Estimate an initial camera pose for each frame ORB-SLAM.
3) Re-process the high resolution videos using a visual SfM.
note:
a) This reprocess refines the initial camera pose and intrinsic parameters.
b) extracts & matches features across frames.
c) global bundle adjustment optimization.
4) remove non-smooth sequenced camera motions.
Compute Dense Depth Maps with a MVS:
- reconstruct each scene's dense geometry using a MVS.
(e.g. https://colmap.github.io/)
- filter outlier depths using the depth refinement method.
- remove erroneous depth values.
note: for each frame, compute a normalized error Δ(p) for every valid pixel ρ. The normalized error σ > 0.2 should be removed.
- estimate optical flow between a reference image and a source image.
- compute an intial depth map from the estimated flow field.
Using TUM RGBD Dataset
- Before running our model, we should estimate camera poses using ORB-SLAM2.
- Some footages have severe motion blur and rolling shutter effects. This causes incorrect pose estimation.
- Then, filter these failures by inspecting the camera trajectory and sparse map.
- After training and validation, we can compare the model's depth prediction with NYU and DIW datasets.
Quantitative Comparisons
Using Scale-Invariant error measures and RMSE, the comparisons between GT depth and the predicted depth by the model are made.
The model is superior in terms of accuracy, but the accuracy error RMSE is at around 0.377. The right most column shows the depth map prediction based on the model.
Code
The code of the neural network model is open sourced at the following github repository.
https://github.com/dparksports
References
. [1] F.Bogo,A.Kanazawa,C.Lassner,P.V.Gehler,J.Romero, and M. J. Black. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii
. [2] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Int. Conf. on 3D Vision (3DV), 2017. ii
. [3] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Neural Information Processing Systems, pages 730–738, 2016. ii, iii, vii, viii
. [4] A.Dai,A.X.Chang,M.Savva,M.Halber,T.A.Funkhouser, and M. Niessner. ScanNet: Richly-annotated 3D reconstruc- tions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [5] M. Dou, S. Khamis, Y. Degtyarev, P. L. Davidson, S. R. Fanello, A. Kowdle, S. Orts, C. Rhemann, D. Kim, J. Tay- lor, P. Kohli, V. Tankovich, and S. Izadi. Fusion4D: real- time performance capture of challenging scenes. ACM Trans. Graphics, 35:114:1–114:13, 2016. ii
. [6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems, pages 2366–2374, 2014. ii, v
. [8] C. Godard, O. M. Aodha, and G. J. Brostow. Unsuper- vised monocular depth estimation with left-right consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [9] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. iv
. [10] I. P. Howard. Seeing in depth, Vol. 1: Basic mechanisms. University of Toronto Press, 2002. i
. [11] E.Ilg,N.Mayer,T.Saikia,M.Keuper,A.Dosovitskiy,and T. Brox. FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. iv
. [12] M. Innmann, M. Zollho ̈fer, M. Niessner, C. Theobalt, and M. Stamminger. VolumeDeform: Real-time volumetric non- rigid reconstruction. In Proc. European Conf. on Computer Vision (ECCV), 2016. ii
. [13] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. In Proc. European Conf. on Computer Vision (ECCV), pages 17–30, 1996. iv
. [14] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. on 3D Vision (3DV), pages 239–248, 2016. ii, vi, vii
. [15] C.Lassner,J.Romero,M.Kiefel,F.Bogo,M.J.Black,and P. V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [16] O. Mees, A. Eitel, and W. Burgard. Choosing Smartly: Adap- tive Multimodal Fusion for Object Detection in Changing Environments. In Int. Conf. on Intelligent Robots and Systems (IROS), 2016. ii
. [17] D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Real- time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graphics, 36:44:1–44:14, 2017. ii
. [18] R. Mur-Artal and J. D. Tardo ́s. Orb-Slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. ii
. [19] R. A. Newcombe, D. Fox, and S. M. Seitz. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision and Pattern Recognition (CVPR), 2015. ii
. [20] B. Ni, G. Wang, and P. Moulin. RGBD-HuDaAct: A color- depth video database for human daily activity recognition. In Proc. ICCV Workshops, 2011. ii
. [21] H. S. Park, T. Shiratori, I. A. Matthews, and Y. Sheikh. 3D Reconstruction of a Moving Point from a Series of 2D Projec- tions. In Proc. European Conf. on Computer Vision (ECCV), 2010. ii
. [22] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D hu- man pose. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [23] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monoc- ular depth estimation in complex dynamic scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii
. [24] C. Russell, R. Yu, and L. Agapito. Video pop-up: Monocular 3d reconstruction of dynamic scenes. In Proc. European Conf. on Computer Vision (ECCV), pages 583–598, 2014. ii, vii
. [25] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii
. [26] J. L. Scho ̈nberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In Proc. European Conf. on
Computer Vision (ECCV), pages 501–518, 2016. ii, iii, iv
. [27] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [28] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. European Conf. on Computer Vision (ECCV), 2012. ii
. [29] T. Simon, J. Valmadre, I. A. Matthews, and Y. Sheikh. Kronecker-Markov Prior for Dynamic 3D Reconstruction. Trans. Pattern Analysis and Machine Intelligence, 39:2201– 2214, 2017. ii
. [30] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [31] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pages 573–580, 2012. vi
. [32] B.Ummenhofer,H.Zhou,J.Uhrig,N.Mayer,E.Ilg,A.Doso- vitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii, vi, vii, viii
. [33] M. Vo, S. G. Narasimhan, and Y. Sheikh. Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction. Proc. Computer Vision and Pattern Recognition (CVPR), 2016. ii
. [34] J. Wulff, L. Sevilla-Lara, and M. J. Black. Optical flow in mostly rigid scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. iv
. [35] J.Xiao,A.Owens,andA.Torralba.Sun3D:Adatabaseofbig spaces reconstructed using sfm and object labels. In Proc. Int. Conf. on Computer Vision (ICCV), pages 1625–1632, 2013. ii
. [36] M. Ye and R. Yang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proc. Computer Vision and Pattern Recognition (CVPR), 2014. ii
. [37] E. Zheng, D. Ji, E. Dunn, and J.-M. Frahm. Sparse Dynamic 3D Reconstruction from Unsynchronized Videos. Proc. Int. Conf. on Computer Vision (ICCV), pages 4435–4443, 2015. ii
. [38] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper- vised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. ii
. [39] Y. Zhu, W. Chen, and G. Guo. Evaluating spatiotemporal in- terest point features for depth-based action recognition. Image and Vision Computing, 32(8):453–464, 2014. ii
. [40] M. Zollho ̈fer, M. Niessner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graphics, 33(4):156, 2014. ii
Comments
Post a Comment