GaussianFormer: Scene as Gaussians for Vision-Based
3D Semantic Occupancy Prediction
Overview of our contributions. Considering the universal approximating ability of Gaussian mixture, we propose an object-centric 3D semantic Gaussian representation to describe the fine-grained structure of 3D scenes without the use of dense grids. We propose a GaussianFormer model consisting of sparse convolution and cross-attention to efficiently transform 2D images into 3D Gaussian representations. To generate dense 3D occupancy, we design a Gaussian-to-voxel splatting module that can be efficiently implemented with CUDA. With comparable performance, our GaussianFormer reduces memory consumption of existing 3D occupancy prediction methods by 75.2% - 82.2%.
The voxel representation assigns each voxel in the 3D space with a feature and is redundant due to the sparsity nature of the 3D space. BEV and TPV employ 2D planes to describe 3D space but can only alleviate the redundancy issue. Differently, the proposed object-centric 3D Gaussian representation can adapt to flexible regions of interest yet can still describe the fine-grained structure of the 3D scene due to the strong approximating ability of mixing Gaussians.
Each Gaussian represents a flexible region of interest and consists of the mean, covariance, and its semantic category.
We randomly initialize a set of queries to instantiate the 3D Gaussians and adopt the cross-attention mechanism to aggregate information from multi-scale image features. We iteratively refine the properties of the 3D Gaussians for smoother optimizations. To efficiently incorporate interactions among 3D Gaussians, we treat them as point clouds located at the Gaussian means and leverage 3D sparse convolutions to process them. We then decode the properties of 3D semantic Gaussians from the updated queries as the scene representation.
Motivated by the 3D Gaussian splatting method in image rendering, we design an efficient Gaussian-to-voxel splatting module that aggregates neighboring Gaussians to generate the semantic occupancy for a certain 3D position.
The proposed 3D Gaussian representation uses a sparse and adaptive set of features to describe a 3D scene but can still model the fine-grained structure due to the universal approximating ability of Gaussian mixtures
@article{huang2024gaussian, title={GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction}, author={Huang, Yuanhui and Zheng, Wenzhao and Zhang, Yunpeng and Zhou, Jie and Lu, Jiwen}, journal={arXiv preprint arXiv:2405.17429}, year={2024} }