Wenzhao Zheng

I am currently a postdoctoral fellow in the Department of EECS at University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research Lab (BAIR) and Berkeley Deep Drive (BDD) , supervised by Prof. Kurt Keutzer . Prior to that, I received my Ph.D degree from the Department of Automation at Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu . In 2018, I received my BS degree from the Department of Physics, Tsinghua University.

I am generally interested in artificial intelligence and deep learning. My current research focuses on:

🦙 Large Models + 🚙 Embodied Agents -> 🤖 AGI

  • 🦙 Large Models and World Models: Efficient/Small LLMs, Multimodal Models, Video Generation Models, Large Action Models...
  • 🚙 Embodied Agents and Spatial Intelligence: 3D Occupancy Prediction, End-to-End Driving, 3D Scene Reconstruction, 4D Scene Simulation...
  • If you want to work with me (in person or remotely) as an intern at BAIR, feel free to drop me an email at wzzheng@berkeley.edu.

    Email  /  CV  /  Google Scholar  /  GitHub

    profile photo
    News

  • 2025-01: Twp papers on indoor and outdoor 3D reconstruction is accepted to ICRA 2025.
  • 2025-01: One paper on autonomous driving is accepted to ICLR 2025.
  • 2025-01: One paper on human mesh recovery is accepted to T-MM.
  • 2024-07: Four papers are accepted to ECCV 2024.
  • 2024-05: One paper on lane detection is accepted to T-IP.
  • 2024-04: One paper on 3D object detection is accepted to T-MM.
  • 2024-02: Two papers on 3D occupancy prediction are accepted to CVPR 2024.
  • 2024-01: One paper on explainable deep learning is accepted to ICLR 2024.
  • *Equal contribution    †Project leader/Corresponding author.

    Newest Papers

    dise GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting
    Jiajun Dong*, Chengkun Wang*, Wenzhao Zheng†, Lei Chen, Jiwen Lu, Yansong Tang
    arXiv, 2025.
    [arXiv] [Code] [中文解读 (in Chinese)]

    GaussianToken represents each image by a set of 2D Gaussian tokens in a feed-forward manner.

    dise UniDrive: Towards Universal Driving Perception Across Camera Configurations
    Ye Li, Wenzhao Zheng†, Xiaonan Huang , Kurt Keutzer
    International Conference on Learning Representations (ICLR), 2025.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    UniDrive presents the first comprehensive framework designed to generalize vision-centric 3D perception models across diverse camera configurations.

    dise GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
    Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng†, Yunpeng Zhang, Dalong Du, Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    GaussianFormer-2 interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry.

    dise GlobalMamba: Global Image Serialization for Vision Mamba
    Chengkun Wang*, Wenzhao Zheng*†, Jie Zhou, Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code]

    GlobalMamba constructs a causal token sequence by frequency, while ensuring that tokens acquire global feature information.

    dise V2M: Visual 2-Dimensional Mamba for Image Representation Learning
    Chengkun Wang*, Wenzhao Zheng*†, Yuanhui Huang, Jie Zhou, Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code]

    Visual 2-Dimensional Mamba (V2M) generalize SSM to the 2-dimensional space and generates the next state considering two adjacent states on both dimensions (e.g., columns and rows) which directly processes image tokens in the 2D space.

    Selected Papers [Full List]

    World Models

    dise Owl-1: Omni World Model for Consistent Long Video Generation
    Yuanhui Huang , Wenzhao Zheng†, Yuan Gao , Xin Tao , Pengfei Wan , Di Zhang , Jie Zhou , Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [中文解读 (in Chinese)]

    Owl-1 approaches consistent long video generation with an omni world model, which models the evolution of the underlying world with latent state, explicit observation and world dynamics variables.

    dise OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
    Wenzhao Zheng*†, Weiliang Chen* , Yuanhui Huang , Borui Zhang , Yueqi Duan, Jiwen Lu
    European Conference on Computer Vision (ECCV), 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    OccWorld models the joint evolutions of 3D scenes and ego movements and paves the way for interpretable end-to-end large driving models.

    Efficient Large Models

    dise SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
    Yuan Zhang*, Chun-Kai Fan*, Junpeng Ma*, Wenzhao Zheng†, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    SparseVLM sparsifies visual tokens adaptively based on the question prompt.

    Large Driving Models [Page]

    dise Doe-1: Closed-Loop Autonomous Driving with Large World Model
    Wenzhao Zheng*†, Zetian Xia*, Yuanhui Huang , Sicheng Zuo, Jie Zhou , Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    Doe-1 is the first closed-loop autonomous driving model for unified perception, prediction, and planning.

    dise GPD-1: Generative Pre-training for Driving
    Zetian Xia*, Sicheng Zuo*, Wenzhao Zheng*†, Yunpeng Zhang, Dalong Du, Jie Zhou, Jiwen Lu, Shanghang Zhang
    arXiv, 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    GPD-1 proposes a unified approach that seamlessly accomplishes multiple aspects of scene evolution, including scene simulation, traffic simulation, closed-loop simulation, map prediction, and motion planning, all without additional fine-tuning.

    dise Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model
    Lening Wang* , Wenzhao Zheng*†, Dalong Du, Yunpeng Zhang, Yilong Ren , Han Jiang , Zhiyong Cui , Haiyang Yu , Jie Zhou, Jiwen Lu, Shanghang Zhang
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    Spatial-Temporal simulAtion for drivinG (Stag-1) enables controllable 4D autonomous driving simulation with spatial-temporal decoupling.

    dise Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving
    Xin Fei , Wenzhao Zheng†, Yueqi Duan , Wei Zhan , Masayoshi Tomizuka , Kurt Keutzer , Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    Driv3R predicts per-frame pointmaps in the global consistent coordinate system in an optimization-free manner.

    End-to-End Autonomou Driving

    dise GaussianAD: Gaussian-Centric End-to-End Autonomous Driving
    Wenzhao Zheng*†, Junjie Wu*, Yao Zheng*, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xianpeng Lang, Shanghang Zhang
    arXiv, 2024.
    [arXiv] [Code] [中文解读 (in Chinese)]

    GaussianAD is a Gaussian-centric end-to-end framework which employs sparse yet comprehensive 3D Gaussians to pass information through the pipeline to efficiently preserve more details.

    dise GenAD: Generative End-to-End Autonomous Driving
    Wenzhao Zheng*, Ruiqi Song* , Xianda Guo* , Chenming Zhang , Long Chen
    European Conference on Computer Vision (ECCV), 2024.
    [arXiv] [Code] [中文解读 (in Chinese)]

    GenAD casts end-to-end autonomous driving as a generative modeling problem.

    3D Occupancy Prediction

    dise GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
    Sicheng Zuo*, Wenzhao Zheng*†, Yuanhui Huang , Jie Zhou , Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code]

    GaussianWorld reformulates 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input and propose a Gaussian World Model to exploit the scene evolution for perception.

    dise GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
    Yuanhui Huang , Wenzhao Zheng†, Yunpeng Zhang , Jie Zhou , Jiwen Lu
    European Conference on Computer Vision (ECCV), 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    GaussianFormer proposes the 3D semantic Gaussians as a more efficient object-centric representation for driving scenes compared with 3D occupancy.

    dise SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
    Yuanhui Huang* , Wenzhao Zheng*†, Borui Zhang , Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    SelfOcc is the first self-supervised work that produces reasonable 3D occupancy for surround cameras.

    dise Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
    Yuanhui Huang* , Wenzhao Zheng*†, Yunpeng Zhang , Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    Given only surround-camera motorcycle RGB images barrier as inputs, our model (trained using trailer only sparse traffic cone LiDAR point supervision) can predict the semantic occupancy for all volumes in the 3D space.

    3D Scene Reconstruction

    dise EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
    Yuqi Wu*, Wenzhao Zheng*†, Sicheng Zuo, Yuanhui Huang , Jie Zhou , Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    EmbodiedOcc formulates an embodied 3D occupancy prediction task and employs a Gaussian-based framework to accomplish it.

    dise PixelGaussian: Generalizable 3D Gaussian Reconstruction from Arbitrary Views
    Xin Fei , Wenzhao Zheng†, Yueqi Duan , Wei Zhan , Masayoshi Tomizuka , Kurt Keutzer , Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    PixelGaussian dynamically adjusts the Gaussian distributions based on geometric complexity in a feed-forward framework.

    dise S3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving
    Nan Huang , Xiaobao Wei , Wenzhao Zheng†, Pengju An , Ming Lu , Wei Zhan , Masayoshi Tomizuka , Kurt Keutzer , Shanghang Zhang
    arXiv, 2024.
    [arXiv] [Code] [Project Page]

    S3Gaussian employs 3D Gaussians to model dynamic scenes for autonomous driving without other supervisions (e.g., 3D bounding boxes).

    Other Topics

    Deep Metric Learning (My Ph.D. Research Topic)

    dise Introspective Deep Metric Learning
    Chengkun Wang* , Wenzhao Zheng*†, Zheng Zhu, Jie Zhou , Jiwen Lu
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2024.
    [arXiv] [Code]

    We propose an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images.

    dise Deep Metric Learning with Adaptively Composite Dynamic Constraints
    Wenzhao Zheng, Jiwen Lu , Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2023.
    [PDF]

    This paper formulates deep metric learning under a unified framework and propose a dynamic constraint generator to produce adaptive composite constraints to train the metric towards good generalization.

    dise Hardness-Aware Deep Metric Learning
    Wenzhao Zheng, Zhaodong Chen , Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019 (oral).
    Wenzhao Zheng, Jiwen Lu , Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2021.
    [PDF] [PDF] (Journal version) [Code]

    We perform linear interpolation on embeddings to adaptively manipulate their hardness levels and generate corresponding label-preserving synthetics for recycled training.

    dise Deep Adversarial Metric Learning
    Yueqi Duan , Wenzhao Zheng, Xudong Lin , Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018 (spotlight).
    Yueqi Duan , Jiwen Lu , Wenzhao Zheng, Jie Zhou
    IEEE Transactions on Image Processing (T-IP, IF: 11.041), 2020.
    [PDF] [PDF] (Journal version) [Code]

    We generate potential hard negatives adversarial to the learned metric as complements.

    dise Structural Deep Metric Learning for Room Layout Estimation
    Wenzhao Zheng, Jiwen Lu Jie Zhou
    European Conference on Computer Vision (ECCV), 2020.
    [PDF]

    We are the first to apply deep metric learning to prediction tasks with structured labels.

    Visual Representation Learning

    dise SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding
    Han Xiao*, Wenzhao Zheng*, Sicheng Zuo, Peng Gao, Jie Zhou , Jiwen Lu
    European Conference on Computer Vision (ECCV), 2024.
    [PDF]

    We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies for vision transformers. To adress this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.

    dise OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions
    Chengkun Wang* , Wenzhao Zheng*, Zheng Zhu, Jie Zhou , Jiwen Lu
    IEEE International Conference on Computer Vision (ICCV), 2023.
    [arXiv] [Code]

    We unify fully supervised and self-supervised contrastive learning and exploit both supervisions from labeled and unlabeled data for training.

    dise Token-Label Alignment for Vision Transformers
    Han Xiao*, Wenzhao Zheng*, Zheng Zhu, Jie Zhou , Jiwen Lu
    IEEE International Conference on Computer Vision (ICCV), 2023.
    [arXiv] [Code]

    We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies for vision transformers. To adress this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.

    Explainable Artificial Intelligence

    dise Path Choice Matters for Clear Attribution in Path Methods
    Borui Zhang, Wenzhao Zheng, Jie Zhou , Jiwen Lu
    International Conference on Learning Representations (ICLR), 2024.
    [arXiv] [Code]

    To address the ambiguity in attributions caused by different path choices, we introduced the Concentration Principle and developed SAMP, an efficient model-agnostic interpreter. By incorporating the infinitesimal constraint (IC) and momentum strategy (MS), SAMP provides superior interpretations.

    dise Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint
    Borui Zhang, Wenzhao Zheng, Jie Zhou , Jiwen Lu
    International Conference on Learning Representations (ICLR), 2023.
    [arXiv] [Code]

    This paper proposes Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints on model parameters, derived from the sufficient conditions of model comprehensibility and transparency.

    dise Attributable Visual Similarity Learning
    Borui Zhang, Wenzhao Zheng, Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [arXiv] [Code]

    This paper proposes an attributable visual similarity learning (AVSL) framework, which employs a generalized similarity learning paradigm to represent the similarity between two images with a graph for a more accurate and explainable similarity measure between images.

    Honors and Awards

  • 2024 Excellent Doctoral Dissertation of Chinese Association for Artificial Intelligence
  • 2023 Tsinghua Excellent Doctoral Dissertation Award
  • 2023 Beijing Outstanding Graduate
  • 2023 Tsinghua Outstanding Graduate
  • 2022 Xuancheng Scholarship
  • 2021 National Scholarship (highest scholarship given by the government of China)
  • 2021 CVPR Outstanding Reviewer
  • 2020 Changtong Scholarship (highest scholarship in the Dept. of Automation)
  • 2019 National Scholarship (highest scholarship given by the government of China)
  • 2017 Tung OOCL Scholarship
  • 2016 German Scholarship
  • Academic Services

  • Conference Reviewer / PC Member: CVPR 2019-2025, ICCV 2019-2023, ECCV 2020-2024, NeurIPS 2023-2024, ICLR 2024-2025, ICML 2025, IJCAI 2020-2024, WACV 2020-2024, ICME 2019-2024,
  • Senior PC Member: IJCAI 2021
  • Journal Reviewer: T-PAMI, T-IP, T-MM, T-CSVT, T-NNLS, T-BIOM, T-IST, Pattern Recognition, Pattern Recognition Letters

  • Website Template


    © Wenzhao Zheng | Last updated: Feb. 1, 2025.