Wenzhao Zheng

Wenzhao Zheng

I am currently a postdoctoral fellow in the Department of EECS at University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research Lab (BAIR) and Berkeley Deep Drive (BDD) , supervised by Prof. Kurt Keutzer . Prior to that, I received my Ph.D degree from the Department of Automation at Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu. In 2018, I received my BS degree from the Department of Physics, Tsinghua University.

I am generally interested in artificial intelligence and deep learning. My current research focuses on:

🦙 Foudation Models + 🚙 Physical Intelligence -> 🤖 AGI

🦙 Foudation Models: Large Models, Visual Generation, AI Safety...

🚙 Physical Intelligence: Spatial Understanding, Autonomous Driving, Embodied Robots...

If you want to work with me (in person or remotely) as an intern at BAIR, feel free to drop me an email at wzzheng@berkeley.edu.

Email / CV / Google Scholar / GitHub

News

2026-01: 4 papers are accepted to ICLR 2026: StreamVGGT, SVG, Astra, and S2GO.

2025-09: 3 papers are accepted to NeurIPS 2025: Point3R, QuadricFormer, and DrivingRecon.

2025-06: 6 papers are accepted to ICCV 2025: EmbodiedOcc, Stag-1, SpectralAR, PlaneRAS, D³QE, and CDAL.

2025-05: 1 paper is accepted to ICML 2025: SparseVLM.

2025-02: 4 papers are accepted to CVPR 2025: SegAnyMo, GaussianWorld, GaussianFormer-2, and DeSiRe-GS.

2025-01: 1 paper is accepted to ICLR 2025: UniDrive.

2024-07: 4 papers are accepted to ECCV 2024: OccWorld, GenAD, GaussianFormer, and SpatialFormer.

2024-02: 2 papers are accepted to CVPR 2024: SelfOcc and LowRankOcc.

2024-01: 1 paper is accepted to ICLR 2024: SAMP.

*Equal contribution ^†Project leader/Corresponding author.

Selected Papers [Full List]

Large Models: Native Multimodal Models, Efficient LLMs, Agents...

dise

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang*, Chun-Kai Fan*, Junpeng Ma*, Wenzhao Zheng^†, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
International Conference on Machine Learning (ICML), 2025.
[arXiv] [Code] [Project Page]

SparseVLM sparsifies visual tokens adaptively based on the question prompt.

dise

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Jerry Jiang*, Haowen Sun*, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng^†
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

Proxy3D enalbe existing vision-language models to process 3D representations for better spatial understanding.

Visual Generation: Generative Frameworks, World Models, Digital Humans...

dise

SVG: Latent Diffusion Model without Variational Autoencoder
Minglei Shi*, Haolin Wang*, Wenzhao Zheng^†, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
International Conference on Learning Representations (ICLR), 2026.
[arXiv] [Code] [Model] [Project Page] [中文解读 (in Chinese)]

SVG is a latent diffusion model without variational autoencoders to unleash Self-supervised representations for Visual Generation.

dise

Astra: General Interactive World Model with Autoregressive Denoising
Yixuan Zhu*, Jiaqi Feng*, Wenzhao Zheng^†, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu
International Conference on Learning Representations (ICLR), 2026.
[arXiv] [Code] [Model] [Project Page]

Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.

AI Safety: Visual Content Forensics, Transparent Models...

dise

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Yifei Li, Wenzhao Zheng^†, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.
[arXiv] [Code] [Project Page] [Model]

Skyra focuses on Grounded Artifact Reasoning to simultaneously perform Artifact Perception, Spatio-Temporal Grounding, and Explanatory Reasoning.

dise

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
Yanran Zhang, Wenzhao Zheng^†, Yifei Li, Bingyao Yu, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.
[Project Page]

UniGenDet bridges generation and detection in a unified, co-evolutionary framework.

Spatial Understanding: 4D Reconstruction, Native 4D Generation...

	Streaming 4D Visual Geometry Transformer Dong Zhuo, Wenzhao Zheng^†*, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu International Conference on Learning Representations (ICLR)*, 2026. [arXiv] [Code] [Project Page] StreamVGGT employs temporal causal attention and leverages cached token memory to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applications.
	Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory Yuqi Wu, Wenzhao Zheng^†*, Jie Zhou, Jiwen Lu Conference on Neural Information Processing Systems (NeurIPS)*, 2025. [arXiv] [Code] [Project Page] Point3R is an online framework for dense streaming 3D reconstruction using explicit spatial memory, which achieves competitive performance with low training costs.
	Segment Any Motion in Videos Nan Huang, Wenzhao Zheng, Chenfeng Xu , Kurt Keutzer , Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arXiv] [Code] [Project Page] Our model produces instance-level fine-grained moving object masks and can handle challenging scenarios including articulated structures, shadow reflections, dynamic background motion, and drastic camera movement.

Autonomou Driving [Page]: 3D Occupancy Prediction, End-to-End Driving, 4D Driving Simulation...

	Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction Yuanhui Huang* , *Wenzhao Zheng^†*, Yunpeng Zhang , Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2023. [arXiv] [Code] [Project Page] [中文解读 (in Chinese)] Given only surround-camera motorcycle RGB images barrier as inputs, our model (trained using trailer only sparse traffic cone LiDAR point supervision) can predict the semantic occupancy for all volumes in the 3D space.
	DVGT: Driving Visual Geometry Transformer Sicheng Zuo, Zixun Xie, *Wenzhao Zheng^†*, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2026. [arXiv] [Code] [Project Page] [中文解读 (in Chinese)] DVGT is a universal visual geometry transformer for autonomous driving, which directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images.
	OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving *Wenzhao Zheng^†*, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, Jiwen Lu European Conference on Computer Vision (ECCV), 2024. [arXiv] [Code] [Project Page] [中文解读 (in Chinese)] OccWorld models the joint evolutions of 3D scenes and ego movements and paves the way for interpretable end-to-end large driving models.
	GenAD: Generative End-to-End Autonomous Driving Wenzhao Zheng, Ruiqi Song , Xianda Guo* , Chenming Zhang , Long Chen European Conference on Computer Vision (ECCV), 2024. [arXiv] [Code] [中文解读 (in Chinese)] GenAD casts end-to-end autonomous driving as a generative modeling problem.

Embodied Robots: Embodied Perception, Navigation, Manipulation, Simulation...

	EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding Yuqi Wu, Wenzhao Zheng^†*, Sicheng Zuo, Yuanhui Huang , Jie Zhou, Jiwen Lu IEEE International Conference on Computer Vision (ICCV)*, 2025. [arXiv] [Code] [Project Page] EmbodiedOcc formulates an embodied 3D occupancy prediction task and employs a Gaussian-based framework to accomplish it.
	SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead Chaojun Ni, Cheng Chen , Xiaofeng Wang* , Zheng Zhu* , Wenzhao Zheng, Boyuan Wang, Tianrun Chen , Guosheng Zhao , Haoyun Li , Zhehao Dong, Qiang Zhang , Yun Ye , Yang Wang , Guan Huang , Wenjun Mei IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. [arXiv] [Code] [Project Page] SwiftVLA integrates 4D spatiotemporal information into a lightweight vision-language-action model at minimal costs.
	AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. AwareVLN equips a VLN agent with self-aware and structured reasoning that is adaptively triggered at key navigation points.

My Ph.D. Research Topic

My Ph.D. Research Topic

Deep Metric Learning

	Introspective Deep Metric Learning Chengkun Wang* , *Wenzhao Zheng^†*, Zheng Zhu, Jie Zhou, Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31)*, 2024. [arXiv] [Code] We propose an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images.
	Deep Metric Learning with Adaptively Composite Dynamic Constraints Wenzhao Zheng, Jiwen Lu, Jie Zhou IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2023. [PDF] This paper formulates deep metric learning under a unified framework and propose a dynamic constraint generator to produce adaptive composite constraints to train the metric towards good generalization.
	Hardness-Aware Deep Metric Learning Wenzhao Zheng, Zhaodong Chen , Jiwen Lu, Jie Zhou IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019 (oral). Wenzhao Zheng, Jiwen Lu, Jie Zhou IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2021. [PDF] [PDF] (Journal version) [Code] We perform linear interpolation on embeddings to adaptively manipulate their hardness levels and generate corresponding label-preserving synthetics for recycled training.
	Deep Adversarial Metric Learning Yueqi Duan, Wenzhao Zheng, Xudong Lin , Jiwen Lu, Jie Zhou IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018 (spotlight). Yueqi Duan, Jiwen Lu, Wenzhao Zheng, Jie Zhou IEEE Transactions on Image Processing (T-IP, IF: 11.041), 2020. [PDF] [PDF] (Journal version) [Code] We generate potential hard negatives adversarial to the learned metric as complements.
	Structural Deep Metric Learning for Room Layout Estimation Wenzhao Zheng, Jiwen Lu Jie Zhou European Conference on Computer Vision (ECCV), 2020. [PDF] We are the first to apply deep metric learning to prediction tasks with structured labels.

Honors and Awards

2025 CAAI Wu Wenjun Artificial Intelligence Natural Science Award (First Class)

2024 Excellent Doctoral Dissertation of Chinese Association for Artificial Intelligence (CAAI)

2023 Tsinghua Excellent Doctoral Dissertation Award

2023 Beijing Outstanding Graduate

2023 Tsinghua Outstanding Graduate

2022 Xuancheng Scholarship

2021 National Scholarship (highest scholarship given by the government of China)

2021 CVPR Outstanding Reviewer

2020 Changtong Scholarship (highest scholarship in the Dept. of Automation)

2019 National Scholarship (highest scholarship given by the government of China)

2017 Tung OOCL Scholarship

2016 German Scholarship

Academic Services

Conference Reviewer / PC Member: CVPR 2019-2026, ICCV 2019-2025, ECCV 2020-2026, NeurIPS 2023-2025, ICLR 2024-2026, ICML 2025-2026, IJCAI 2020-2025, WACV 2020-2025, ICME 2019-2025

Senior PC Member: IJCAI 2021

Journal Reviewer: T-PAMI, T-IP, T-MM, T-CSVT, T-NNLS, T-BIOM, T-IST, Pattern Recognition, Pattern Recognition Letters

Website Template

© Wenzhao Zheng | Last updated: Apr. 1st, 2026.