Wenzhao Zheng
I am currently a postdoctoral fellow in the Department of EECS at University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research Lab (BAIR) and Berkeley Deep Drive (BDD) , supervised by Prof. Kurt Keutzer .
Prior to that, I received my Ph.D degree from the Department of Automation at Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu .
In 2018, I received my BS degree from the Department of Physics, Tsinghua University.
I am generally interested in artificial intelligence and deep learning. My current research focuses on:
🦙 Large Models + 🚙 Embodied Agents -> 🤖 AGI
🦙 Large Models and World Models: Efficient/Small LLMs, Multimodal Models, Video Generation Models, Large Action Models...
🚙 Embodied Agents and Spatial Intelligence: 3D Occupancy Prediction, End-to-End Driving, 3D Scene Reconstruction, 4D Scene Simulation...
If you want to work with me (in person or remotely) as an intern at BAIR, feel free to drop me an email at wzzheng@berkeley.edu.
Email  / 
CV  / 
Google Scholar  / 
GitHub
|
|
News
2025-01: Twp papers on indoor and outdoor 3D reconstruction is accepted to ICRA 2025.
2025-01: One paper on autonomous driving is accepted to ICLR 2025.
2025-01: One paper on human mesh recovery is accepted to T-MM.
2024-07: Four papers are accepted to ECCV 2024.
2024-05: One paper on lane detection is accepted to T-IP.
2024-04: One paper on 3D object detection is accepted to T-MM.
2024-02: Two papers on 3D occupancy prediction are accepted to CVPR 2024.
2024-01: One paper on explainable deep learning is accepted to ICLR 2024.
|
*Equal contribution †Project leader/Corresponding author.
|
|
GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting
Jiajun Dong*,
Chengkun Wang*,
Wenzhao Zheng†,
Lei Chen,
Jiwen Lu,
Yansong Tang
arXiv, 2025.
[arXiv]
[Code]
[ä¸æ–‡è§£è¯» (in Chinese)]
GaussianToken represents each image by a set of 2D Gaussian tokens in a feed-forward manner.
|
|
UniDrive: Towards Universal Driving Perception Across Camera Configurations
Ye Li,
Wenzhao Zheng†,
Xiaonan Huang ,
Kurt Keutzer
International Conference on Learning Representations (ICLR), 2025.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
UniDrive presents the first comprehensive framework designed to generalize vision-centric 3D perception models across diverse camera configurations.
|
|
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
Yuanhui Huang,
Amonnut Thammatadatrakoon,
Wenzhao Zheng†,
Yunpeng Zhang,
Dalong Du,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
GaussianFormer-2 interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry.
|
|
GlobalMamba: Global Image Serialization for Vision Mamba
Chengkun Wang*,
Wenzhao Zheng*†,
Jie Zhou,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
GlobalMamba constructs a causal token sequence by frequency, while ensuring that tokens acquire global feature information.
|
|
V2M: Visual 2-Dimensional Mamba for Image Representation Learning
Chengkun Wang*,
Wenzhao Zheng*†,
Yuanhui Huang,
Jie Zhou,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
Visual 2-Dimensional Mamba (V2M) generalize SSM to the 2-dimensional space and generates the next state considering two adjacent states on both dimensions (e.g., columns and rows) which directly processes image tokens in the 2D space.
|
World Models
|
Owl-1: Omni World Model for Consistent Long Video Generation
Yuanhui Huang ,
Wenzhao Zheng†,
Yuan Gao ,
Xin Tao ,
Pengfei Wan ,
Di Zhang ,
Jie Zhou ,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
[ä¸æ–‡è§£è¯» (in Chinese)]
Owl-1 approaches consistent long video generation with an omni world model, which models the evolution of the underlying world with latent state, explicit observation and world dynamics variables.
|
|
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
Wenzhao Zheng*†,
Weiliang Chen* ,
Yuanhui Huang ,
Borui Zhang ,
Yueqi Duan,
Jiwen Lu
European Conference on Computer Vision (ECCV), 2024.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
OccWorld models the joint evolutions of 3D scenes and ego movements and paves the way for interpretable end-to-end large driving models.
|
Efficient Large Models
|
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang*,
Chun-Kai Fan*,
Junpeng Ma*,
Wenzhao Zheng†,
Tao Huang,
Kuan Cheng,
Denis Gudovskiy,
Tomoyuki Okuno,
Yohei Nakata,
Kurt Keutzer,
Shanghang Zhang
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
SparseVLM sparsifies visual tokens adaptively based on the question prompt.
|
Large Driving Models [Page]
|
Doe-1: Closed-Loop Autonomous Driving with Large World Model
Wenzhao Zheng*†,
Zetian Xia*,
Yuanhui Huang ,
Sicheng Zuo,
Jie Zhou ,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
Doe-1 is the first closed-loop autonomous driving model for unified perception, prediction, and planning.
|
|
GPD-1: Generative Pre-training for Driving
Zetian Xia*,
Sicheng Zuo*,
Wenzhao Zheng*†,
Yunpeng Zhang,
Dalong Du,
Jie Zhou,
Jiwen Lu,
Shanghang Zhang
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
GPD-1 proposes a unified approach that seamlessly accomplishes multiple aspects of scene evolution, including scene simulation, traffic simulation, closed-loop simulation, map prediction, and motion planning, all without additional fine-tuning.
|
|
Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model
Lening Wang* ,
Wenzhao Zheng*†,
Dalong Du,
Yunpeng Zhang,
Yilong Ren ,
Han Jiang ,
Zhiyong Cui ,
Haiyang Yu ,
Jie Zhou,
Jiwen Lu,
Shanghang Zhang
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
Spatial-Temporal simulAtion for drivinG (Stag-1) enables controllable 4D autonomous driving simulation with spatial-temporal decoupling.
|
|
Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving
Xin Fei ,
Wenzhao Zheng†,
Yueqi Duan ,
Wei Zhan ,
Masayoshi Tomizuka ,
Kurt Keutzer ,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
Driv3R predicts per-frame pointmaps in the global consistent coordinate system in an optimization-free manner.
|
End-to-End Autonomou Driving
|
GaussianAD: Gaussian-Centric End-to-End Autonomous Driving
Wenzhao Zheng*†,
Junjie Wu*,
Yao Zheng*,
Sicheng Zuo,
Zixun Xie,
Longchao Yang,
Yong Pan,
Zhihui Hao,
Peng Jia,
Xianpeng Lang,
Shanghang Zhang
arXiv, 2024.
[arXiv]
[Code]
[ä¸æ–‡è§£è¯» (in Chinese)]
GaussianAD is a Gaussian-centric end-to-end framework which employs sparse yet comprehensive 3D Gaussians to pass information through the pipeline to efficiently preserve more details.
|
|
GenAD: Generative End-to-End Autonomous Driving
Wenzhao Zheng*,
Ruiqi Song* ,
Xianda Guo* ,
Chenming Zhang ,
Long Chen
European Conference on Computer Vision (ECCV), 2024.
[arXiv]
[Code]
[ä¸æ–‡è§£è¯» (in Chinese)]
GenAD casts end-to-end autonomous driving as a generative modeling problem.
|
3D Occupancy Prediction
|
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
Sicheng Zuo*,
Wenzhao Zheng*†,
Yuanhui Huang ,
Jie Zhou ,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
GaussianWorld reformulates 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input and propose a Gaussian World Model to exploit the scene evolution for perception.
|
|
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
Yuanhui Huang ,
Wenzhao Zheng†,
Yunpeng Zhang ,
Jie Zhou ,
Jiwen Lu
European Conference on Computer Vision (ECCV), 2024.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
GaussianFormer proposes the 3D semantic Gaussians as a more efficient object-centric representation for driving scenes compared with 3D occupancy.
|
|
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
Yuanhui Huang* ,
Wenzhao Zheng*†,
Borui Zhang ,
Jie Zhou ,
Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
SelfOcc is the first self-supervised work that produces reasonable 3D occupancy for surround cameras.
|
|
Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
Yuanhui Huang* ,
Wenzhao Zheng*†,
Yunpeng Zhang ,
Jie Zhou ,
Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[arXiv]
[Code]
[Project Page]
[ä¸æ–‡è§£è¯» (in Chinese)]
Given only surround-camera motorcycle RGB images barrier as inputs, our model (trained using trailer only sparse traffic cone LiDAR point supervision) can predict the semantic occupancy for all volumes in the 3D space.
|
3D Scene Reconstruction
|
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
Yuqi Wu*,
Wenzhao Zheng*†,
Sicheng Zuo,
Yuanhui Huang ,
Jie Zhou ,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
EmbodiedOcc formulates an embodied 3D occupancy prediction task and employs a Gaussian-based framework to accomplish it.
|
|
PixelGaussian: Generalizable 3D Gaussian Reconstruction from Arbitrary Views
Xin Fei ,
Wenzhao Zheng†,
Yueqi Duan ,
Wei Zhan ,
Masayoshi Tomizuka ,
Kurt Keutzer ,
Jiwen Lu
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
PixelGaussian dynamically adjusts the Gaussian distributions based on geometric complexity in a feed-forward framework.
|
|
S3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving
Nan Huang ,
Xiaobao Wei ,
Wenzhao Zheng†,
Pengju An ,
Ming Lu ,
Wei Zhan ,
Masayoshi Tomizuka ,
Kurt Keutzer ,
Shanghang Zhang
arXiv, 2024.
[arXiv]
[Code]
[Project Page]
S3Gaussian employs 3D Gaussians to model dynamic scenes for autonomous driving without other supervisions (e.g., 3D bounding boxes).
|
Deep Metric Learning (My Ph.D. Research Topic)
|
Introspective Deep Metric Learning
Chengkun Wang* ,
Wenzhao Zheng*†,
Zheng Zhu,
Jie Zhou ,
Jiwen Lu
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2024.
[arXiv]
[Code]
We propose an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images.
|
|
Deep Metric Learning with Adaptively Composite Dynamic Constraints
Wenzhao Zheng,
Jiwen Lu ,
Jie Zhou
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2023.
[PDF]
This paper formulates deep metric learning under a unified framework and propose a dynamic constraint generator to produce adaptive composite constraints to train the metric towards good generalization.
|
|
Hardness-Aware Deep Metric Learning
Wenzhao Zheng,
Zhaodong Chen ,
Jiwen Lu ,
Jie Zhou
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019 (oral).
Wenzhao Zheng,
Jiwen Lu ,
Jie Zhou
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2021.
[PDF]
[PDF] (Journal version)
[Code]
We perform linear interpolation on embeddings to adaptively manipulate their hardness levels and generate corresponding label-preserving synthetics for recycled training.
|
|
Deep Adversarial Metric Learning
Yueqi Duan ,
Wenzhao Zheng,
Xudong Lin ,
Jiwen Lu ,
Jie Zhou
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018 (spotlight).
Yueqi Duan ,
Jiwen Lu ,
Wenzhao Zheng,
Jie Zhou
IEEE Transactions on Image Processing (T-IP, IF: 11.041), 2020.
[PDF]
[PDF] (Journal version)
[Code]
We generate potential hard negatives adversarial to the learned metric as complements.
|
|
Structural Deep Metric Learning for Room Layout Estimation
Wenzhao Zheng,
Jiwen Lu
Jie Zhou
European Conference on Computer Vision (ECCV), 2020.
[PDF]
We are the first to apply deep metric learning to prediction tasks with structured labels.
|
Visual Representation Learning
|
SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding
Han Xiao*,
Wenzhao Zheng*,
Sicheng Zuo,
Peng Gao,
Jie Zhou ,
Jiwen Lu
European Conference on Computer Vision (ECCV), 2024.
[PDF]
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies for vision transformers. To adress this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
|
|
OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions
Chengkun Wang* ,
Wenzhao Zheng*,
Zheng Zhu,
Jie Zhou ,
Jiwen Lu
IEEE International Conference on Computer Vision (ICCV), 2023.
[arXiv]
[Code]
We unify fully supervised and self-supervised contrastive learning and exploit both supervisions from labeled and unlabeled data for training.
|
|
Token-Label Alignment for Vision Transformers
Han Xiao*,
Wenzhao Zheng*,
Zheng Zhu,
Jie Zhou ,
Jiwen Lu
IEEE International Conference on Computer Vision (ICCV), 2023.
[arXiv]
[Code]
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies for vision transformers. To adress this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
|
Explainable Artificial Intelligence
|
Path Choice Matters for Clear Attribution in Path Methods
Borui Zhang,
Wenzhao Zheng,
Jie Zhou ,
Jiwen Lu
International Conference on Learning Representations (ICLR), 2024.
[arXiv]
[Code]
To address the ambiguity in attributions caused by different path choices, we introduced the Concentration Principle and developed SAMP, an efficient model-agnostic interpreter. By incorporating the infinitesimal constraint (IC) and momentum strategy (MS), SAMP provides superior interpretations.
|
|
Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint
Borui Zhang,
Wenzhao Zheng,
Jie Zhou ,
Jiwen Lu
International Conference on Learning Representations (ICLR), 2023.
[arXiv]
[Code]
This paper proposes Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints on model parameters, derived from the sufficient conditions of model comprehensibility and transparency.
|
|
Attributable Visual Similarity Learning
Borui Zhang,
Wenzhao Zheng,
Jie Zhou ,
Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[arXiv]
[Code]
This paper proposes an attributable visual similarity learning (AVSL) framework, which employs a generalized similarity learning paradigm to represent the similarity between two images with a graph for a more accurate and explainable similarity measure between images.
|
Honors and Awards
2024 Excellent Doctoral Dissertation of Chinese Association for Artificial Intelligence
2023 Tsinghua Excellent Doctoral Dissertation Award
2023 Beijing Outstanding Graduate
2023 Tsinghua Outstanding Graduate
2022 Xuancheng Scholarship
2021 National Scholarship (highest scholarship given by the government of China)
2021 CVPR Outstanding Reviewer
2020 Changtong Scholarship (highest scholarship in the Dept. of Automation)
2019 National Scholarship (highest scholarship given by the government of China)
2017 Tung OOCL Scholarship
2016 German Scholarship
|
Academic Services
Conference Reviewer / PC Member: CVPR 2019-2025, ICCV 2019-2023, ECCV 2020-2024, NeurIPS 2023-2024, ICLR 2024-2025, ICML 2025, IJCAI 2020-2024, WACV 2020-2024, ICME 2019-2024,
Senior PC Member: IJCAI 2021
Journal Reviewer: T-PAMI, T-IP, T-MM, T-CSVT, T-NNLS, T-BIOM, T-IST, Pattern Recognition, Pattern Recognition Letters
|
|