GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

Figure 1 Comparison of pipelines for 3D world generation. (a) Appearance-only generation. (b) Joint geometry+appearance diffusion. (c) Geometry-Then-Appearance (GTA): estimate geometry first, then synthesize appearance conditioned on geometry.

Teaser Our method follows a Geometry-Then-Appearance paradigm (GTA): it first infers a globally consistent geometric scaffold and then uses it to guide high-fidelity appearance synthesis, producing realistic and cross-view consistent 3D worlds from a single image.

Results. Browse additional examples in the carousel below.

Abstract

Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and world modeling. While achieving remarkable progress, existing approaches to 3D scene generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments demonstrate that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, while also supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability.

Method

(a) Geometry-then-appearance video diffusion. Starting from a single input image, we first perform geometry video diffusion to generate view-consistent multi-view depth via depth-based warping. Conditioned on the predicted geometry, the appearance video diffusion synthesizes coherent novel-view RGB video with fine-grained textures.

(b) Random latent shuffle. During training, latent representations of target views are randomly permuted with probability p. The same permutation is applied to both predicted and ground-truth latents, preserving supervision while discouraging view-order–dependent correlations.

(c) Test-time scaling. A subset of reliable views is progressively selected and reprojected to refine partial observations. Through masked warping, reliable regions are fixed while unreliable ones are suppressed, enabling stable coarse-to-fine synthesis with improved cross-view consistency.

Experiments

• Quantitative comparisons with state-of-the-art methods.

Table Quantitative comparison across fidelity, perceptual quality, and geometric accuracy on DL3DV and RealEstate. ↑ / ↓ indicates higher / lower is better. Best, second-best, third-best are highlighted in red, orange, yellow. All results are reported using single-pass inference without test-time scaling.

• Qualitative comparisons with state-of-the-art methods.

Ablation Studies

• Geometry then appearance video diffusion.

• Random latent shuffle.

• Test-time scaling.

Post-doc enhancement

Applications on downstream tasks.

• 3D scene editing.

• Video depth estimation.

Data efficiency.

Figure Data efficiency comparison. Our method achieves comparable or even better performance than state-of-the-art approaches with an order of magnitude fewer training samples, and continues to improve steadily as the training data scale increases.

BibTeX

@article{zhu2026gta,
  author    = {Zhu, Hanxin and Wang, Cong and Tu, Peiyan and Luo, Jiayi and He, Tianyu and Jin Xin and Chen, Zhibo},
  title     = {GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion},
  journal   = {arXiv},
  year      = {2026},
}