MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

1USTC, 2Microsoft Research, 3Harvard, 4Tsinghua University

MoGe turns 2D single images into 3D point maps.

Abstract

We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks including monocular estimation of 3D point map, depth map, and camera field of view.

teaser image

Results

Click on the images below to see our point map results as meshes in a 3D viewer.

πŸ’‘Tips

● Scroll to zoom in/out

● Drag to rotate

● Press "shift" and drag to pan

● Click on the buttons at the top to switch texture color on/off

Results on Videos

We predict the point maps for the video frames and then simply use rigid (similarity) transformations computed from image matching (PDCNet) to register them.

Comparison to Other Methods

Select a method from the dropdown menu to compare the results of MoGe with it side by side.

MoGe
πŸ’‘Tips

● Scroll to zoom in/out

● Drag to rotate

● Press "shift" and drag to pan

● Click on the buttons at the top to switch texture color on/off

*No camera intrinsics prediction; using ours instead.

More Uncurated Results and Comparisons

Check how our method compares to other methods on uncurated images (sourced from the first 100 images in DIV2K).

Original Image Ours LeReS DUSt3R UniDepth Metric3D V2* Depth Anything V2*

*No camera intrinsics prediction; using ours instead.

BibTeX

@misc{wang2024moge,
    title={MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision},
    author={Wang, Ruicheng and Xu, Sicheng and Dai, Cassie and Xiang, Jianfeng and Deng, Yu and Tong, Xin and Yang, Jiaolong},
    year={2024},
    eprint={2410.19115},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.19115}, 
}