StreamVGGT: Streaming 4D Visual Geometry Transformer
Overview of our contributions. We propose StreamVGGT, a novel causal transformer architecture specifically designed for efficient, real-time streaming 4D visual geometry reconstruction. Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions.
Our model consists of three main components: an image encoder, a spatio-temporal decoder, and multi-task prediction heads. During training, we utilize full-sequence inputs to provide the model with complete contextual information. To enforce temporal causality, we apply causal attention so the model can only attend to past frames at any given time step. This design encourages temporal modeling suitable for streaming inference.
During streaming inference, we cache the historical keys and values as implicit memory to store information from past frames. This memory allows the model to efficiently reuse previously computed representations, avoiding redundant computation and enabling consistent contextual understanding across time.
@article{streamVGGT, title={Streaming 4D Visual Geometry Transformer}, author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu}, journal={arXiv preprint arXiv:2507.11539}, year={2025} }