DVGT-2: Vision-Geometry-Action Model
for Autonomous Driving at Scale

Sicheng Zuo^{1, *}, Zixun Xie^{1, 2, 4, *}, Wenzhao Zheng^{1, *}, Shaoqing Xu²

Fang Li², Hanbing Li², Long Chen², Zhi-Xin Yang³, Jiwen Lu¹

¹Tsinghua University ²Xiaomi EV ³University of Macau ⁴Peking University

^*Equal Contributions

Unlike existing driving models that rely on inverse perspective mapping or sparse perception results as driving interfaces, DVGT-2 reconstructs dense 3D geometry to provide a comprehensive and detailed scene representation.

Given unposed multi-view image sequences, DVGT-2 performs joint geometry reconstruction and trajectory planning in a fully online manner, enabling efficient and robust inference across infinite-length driving scenarios.

Operating on online input sequences, DVGT-2 streamingly infers the global geometry of the entire scene, demonstrating high fidelity and temporal consistency.

Overview

We propose DVGT-2, a streaming Vision-Geometry-Action (VGA) model for efficient, real-time autonomous driving. Given unposed multi-view image sequences, DVGT-2 jointly predicts dense 3D scene geometry and future trajectories in a fully online manner. By employing temporal causal attention and a sliding-window feature cache, our model enables on-the-fly inference with a constant O(1) computational cost, seamlessly supporting infinite-length, continuous driving.

Method

DVGT-2 consists of three core components: an image encoder, a geometry transformer, and multi-task prediction heads. The model is trained to jointly perform dense geometry reconstruction and trajectory planning across full video sequences. By applying temporal causal attention, we ensure the model attends only to past frames while planning for the future at any given time step. This auto-regressive design naturally aligns the training process with streaming inference and significantly boosts training efficiency.

During online inference, we maintain a sliding-window feature cache to store historical intermediate representations within a certain interval. This mechanism allows DVGT-2 to efficiently reuse previously computed features and completely avoid redundant computations. As a result, the model achieves a constant O(1) computational complexity, enabling consistent geometry reconstruction and robust trajectory planning across infinite-length driving scenarios.

Experiments

DVGT-2 achieves SOTA performance in both 3D geometry reconstruction and trajectory planning across diverse datasets. Relying solely on annotation-efficient geometry supervision, it operates entirely without sparse perception labels or language annotations. By leveraging an efficient sliding-window streaming strategy, DVGT-2 enables real-time, infinite-length online inference with a constant memory cost and a mere 266ms per-frame latency, paving the way for next-generation autonomous driving.

BibTeX

@inproceedings{zuo2026dvgt-2,
  title={DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale},
  author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Li, Hanbing and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
  booktitle={arXiv preprint arXiv:2604.00813},
  year={2026}
}