MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

Hanzhe Hu*¹, Zhizhuo Zhou*², Varun Jampani³, Shubham Tulsiani¹

* Equal contribution

¹Carnegie Mellon University, ²Stanford University, ³Stability AI

CVPR 2024

Paper Code

MVD-Fusion predits multi-view RGB-D images from a single input image, allowing for output to multiple 3D representations such as point clouds, textured meshes, and Gaussian splats.

Abstract

We introduce MVD-Fusion, a method for single-view 3D inference via generative modeling of multi-view RGB-D Images. While recent methods for 3D generation uses novel-view generative models, they often require a lengthy distillation process to generate a 3D output. Instead, we cast 3D inference as directly generating multiple consistent RGB-D views and build upon the insight that inferring depth in addition to RGB can provide a mechanism for enforcing 3D consistency. Our method is able to train on both synthetic data, Objaverse, and real-world data, CO3D. We demonstrate the generalization of our model on both real and generated in-the-wild images.

Results on the Objaverse

We compare our method against Zero-1-to-3 and SyncDreamer on the Objaverse test set. Our results are more consistent than Zero-1-to-3 and more realistic than SyncDreamer.

Results on the Google Scanned Object (GSO)

We compare our method with Zero-1-to-3 and SyncDreamer, and visualize the input image, and multi-view images generated by each method as well as the ground-truth.

Results of RGB-D Diffusion

We show full RGB-D results with our predicted low-resolution depth map.

More Results

Citation

If you found this work is useful in your own research, please considering citing the following.

        
    @inproceedings{hu2024mvdfusion,
        title={MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation}, 
        author={Hanzhe Hu and Zhizhuo Zhou and Varun Jampani and Shubham Tulsiani},
        booktitle={CVPR},
        year={2024}
    }

Acknowledgements

We thank Bharath Raj, Jason Y. Zhang, Yufei (Judy) Ye, Yanbo Xu, and Zifan Shi for helpful discussions and feedback. This work is supported in part by NSF GRFP Grant No. (DGE1745016, DGE2140739).