CVPR 2024
MVD-Fusion predits multi-view RGB-D images from a single input image, allowing for output to multiple 3D representations such as point clouds, textured meshes, and Gaussian splats.
We introduce MVD-Fusion, a method for single-view 3D inference via generative modeling of multi-view RGB-D Images. While recent methods for 3D generation uses novel-view generative models, they often require a lengthy distillation process to generate a 3D output. Instead, we cast 3D inference as directly generating multiple consistent RGB-D views and build upon the insight that inferring depth in addition to RGB can provide a mechanism for enforcing 3D consistency. Our method is able to train on both synthetic data, Objaverse, and real-world data, CO3D. We demonstrate the generalization of our model on both real and generated in-the-wild images.
We compare our method against Zero-1-to-3 and SyncDreamer on the Objaverse test set. Our results are more consistent than Zero-1-to-3 and more realistic than SyncDreamer.
We compare our method with Zero-1-to-3 and SyncDreamer, and visualize the input image, and multi-view images generated by each method as well as the ground-truth.
We show full RGB-D results with our predicted low-resolution depth map.
If you found this work is useful in your own research, please considering citing the following.
@inproceedings{hu2024mvdfusion,
title={MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation},
author={Hanzhe Hu and Zhizhuo Zhou and Varun Jampani and Shubham Tulsiani},
booktitle={CVPR},
year={2024}
}
We thank Bharath Raj, Jason Y. Zhang, Yufei (Judy) Ye, Yanbo Xu, and Zifan Shi for helpful discussions and feedback. This work is supported in part by NSF GRFP Grant No. (DGE1745016, DGE2140739).