VGGT-Edit: Feed-forward 3D Scene Editing

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

Sophie Weber

May 14, 2026

|9 Min Read

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex…

Reporting by Kaixin Zhu, SwissFinanceAI Redaktion

ai-toolsnewsresearch

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.

Source

Original Article: VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

Published: May 14, 2026

Author: Kaixin Zhu

Disclaimer: This article is for informational purposes only and does not constitute financial advice. Consult a licensed financial advisor before making investment decisions.

References

[1]NewsCredibility: 9/10

ArXiv AI Papers. "VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction." May 14, 2026.

https://arxiv.org/abs/2605.15186v1

Transparency Notice: This article may contain AI-assisted content. All citations link to verified sources. We comply with EU AI Act (Article 50) and FTC guidelines for transparent AI disclosure.

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

Source

References

blog.relatedArticles

Turning AI cost spikes into strategic growth opportunities

The enterprise risk nobody is modeling: AI is replacing the very experts it needs to learn from

Task-Adaptive Embedding Refinement via Test-time LLM Guidance