Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang
arXiv preprint, 2024
We introduce a novel 3D generation method for versatile and high-quality 3D asset
creation.
The cornerstone is a unified Structured LATent (SLAT) representation which allows
decoding
to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This
is
achieved by integrating a sparsely-populated 3D grid with dense multiview visual
features
extracted from a powerful vision foundation model, comprehensively capturing both
structural
(geometry) and textural (appearance) information while maintaining flexibility
during
decoding. We employ rectified flow transformers tailored for SLAT as our 3D
generation
models and train models with up to 2 billion parameters on a large 3D asset dataset
of 500K
diverse objects. Our model generates high-quality results with text or image
conditions,
significantly surpassing existing methods, including recent ones at similar scales.
We
showcase flexible output format selection and local 3D editing capabilities which
were not
offered by previous models.
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, Jiaolong Yang
arXiv preprint, 2024
We present MoGe, a powerful model for recovering 3D geometry from monocular
open-domain
images. Given a single image, our model directly predicts a 3D point map of the
captured
scene with an affine-invariant representation, which is agnostic to true global
scale and
shift. This new representation precludes ambiguous supervision in training and
facilitate
effective geometry learning. Furthermore, we propose a set of novel global and local
geometry supervisions that empower the model to learn high-quality geometry. These
include a
robust, optimal, and efficient point cloud alignment solver for accurate global
shape
learning, and a multi-scale local geometry loss promoting precise local geometry
supervision. We train our model on a large, mixed dataset and demonstrate its strong
generalizability and high accuracy. In our comprehensive evaluation on diverse
unseen
datasets, our model significantly outperforms state-of-the-art methods across all
tasks
including monocular estimation of 3D point map, depth map, and camera field of view.
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong
European Conference on Computer Vision (ECCV), 2024
We propose a novel image editing technique that enables 3D manipulations on single
images,
such as object rotation and translation. Existing 3D-aware image editing approaches
typically rely on synthetic multi-view datasets for training specialized models,
thus
constraining their effectiveness on open-domain images featuring significantly more
varied
layouts and styles. In contrast, our method directly leverages powerful image
diffusion
models trained on a broad spectrum of text-image pairs and thus retain their
exceptional
generalization abilities. This objective is realized through the development of an
iterative
novel view synthesis and geometry alignment algorithm. The algorithm harnesses
diffusion
models for dual purposes: they provide appearance prior by predicting novel views of
the
selected object using estimated depth maps, and they act as a geometry critic by
correcting
misalignments in 3D shapes across the sampled views. Our method can generate
high-quality
3D-aware image edits with large viewpoint transformations and high appearance and
shape
consistency with the input image, pushing the boundaries of what is possible with
single-image 3D-aware editing.
Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images
Yuxuan Han, Ruicheng Wang, Jiaolong Yang
ACM SIGGRAPH Conference, 2022
This paper deals with the challenging task of synthesizing novel views for
in-the-wild
photographs. Existing methods have shown promising results leveraging monocular
depth
estimation and color inpainting with layered depth representations. However, these
methods
still have limited capability to handle scenes with complex 3D geometry. We propose
a new
method based on the multiplane image (MPI) representation. To accommodate diverse
scene
layouts in the wild and tackle the difficulty in producing high-dimensional MPI
contents, we
design a network structure that consists of two novel modules, one for plane depth
adjustment and another for depth-aware color prediction. The former adjusts the
initial
plane positions using the RGBD context feature and an attention mechanism. Given
adjusted
depth values, the latter predicts the color and density for each plane separately
with
proper inter-plane interactions achieved via a feature masking strategy. To train
our
method, we construct large-scale stereo training data using only unconstrained
single-view
image collections by a simple yet effective warp-back strategy
VirtualCube: An Immersive 3D Video Communication System
Yizhong Zhang*, Jiaolong Yang*, Zhen Liu, Ruicheng Wang, Guojun Chen, Xin Tong, Baining Guo.
IEEE Conference on Virtual Reality and 3D User Interfaces (VR2022) (& IEEE TVCG)
(Best
Journal Paper Award), 2021
The VirtualCube system is a 3D video conference system that attempts to overcomesome
limitations of conventionaltechnologies. The key ingredient is VirtualCube, an abstract representation of a
real-world
cubicle instrumented with RGBD cameras
for capturing the user's 3D geometry and texture. We design VirtualCube so that the
task of
data capturing is standardized and
significantly simplified, and everything can be built using off-the-shelf hardware.
We use
VirtualCubes as the basic building blocks of a
virtual conferencing environment, and we provide each VirtualCube user with a
surrounding
display showing life-size videos of remote
participants. To achieve real-time rendering of remote participants, we develop the
V-Cube
View algorithm, which uses multi-view
stereo for more accurate depth estimation and Lumi-Net rendering for better
rendering
quality. The VirtualCube system correctly
preserves the mutual eye gaze between participants, allowing them to establish eye
contact
and be aware of who is visually paying
attention to them. The system also allows a participant to have side discussions
with remote
participants as if they were in the same
room. Finally, the system sheds lights on how to support the shared space of work
items
(e.g., documents and applications) and track
participants' visual attention to work items