Wenzhao Zheng

I am currently a postdoctoral fellow in the Department of EECS at University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research Lab (BAIR) and Berkeley Deep Drive (BDD) , supervised by Prof. Kurt Keutzer . Prior to that, I received my Ph.D degree from the Department of Automation at Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu. In 2018, I received my BS degree from the Department of Physics, Tsinghua University.

I am generally interested in artificial intelligence and deep learning. My current research focuses on:

🦙 Foudation Models + 🚙 Physical Intelligence -> 🤖 AGI

  • 🦙 Foudation Models: Large Models, Visual Generation, AI Safety...
  • 🚙 Physical Intelligence: Spatial Understanding, Autonomous Driving, Embodied Robots...
  • If you want to work with me (in person or remotely) as an intern at BAIR, feel free to drop me an email at wzzheng@berkeley.edu.

    Email  /  CV  /  Google Scholar  /  GitHub

    profile photo
    News

  • 2026-01: 4 papers are accepted to ICLR 2026: StreamVGGT, SVG, Astra, and S2GO.
  • 2025-09: 3 papers are accepted to NeurIPS 2025: Point3R, QuadricFormer, and DrivingRecon.
  • 2025-06: 6 papers are accepted to ICCV 2025: EmbodiedOcc, Stag-1, SpectralAR, PlaneRAS, D3QE, and CDAL.
  • 2025-05: 1 paper is accepted to ICML 2025: SparseVLM.
  • 2025-02: 4 papers are accepted to CVPR 2025: SegAnyMo, GaussianWorld, GaussianFormer-2, and DeSiRe-GS.
  • 2025-01: 1 paper is accepted to ICLR 2025: UniDrive.
  • 2024-07: 4 papers are accepted to ECCV 2024: OccWorld, GenAD, GaussianFormer, and SpatialFormer.
  • 2024-02: 2 papers are accepted to CVPR 2024: SelfOcc and LowRankOcc.
  • 2024-01: 1 paper is accepted to ICLR 2024: SAMP.

  • *Equal contribution    †Project leader/Corresponding author.

    Selected Papers [Full List]

    Large Models: Native Multimodal Models, Efficient LLMs, Large Action Models...

    dise SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
    Yuan Zhang*, Chun-Kai Fan*, Junpeng Ma*, Wenzhao Zheng†, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
    International Conference on Machine Learning (ICML), 2025.
    [arXiv] [Code] [Project Page]

    SparseVLM sparsifies visual tokens adaptively based on the question prompt.

    dise Segment Any Motion in Videos
    Nan Huang, Wenzhao Zheng, Chenfeng Xu , Kurt Keutzer , Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
    [arXiv] [Code] [Project Page]

    Our model produces instance-level fine-grained moving object masks and can handle challenging scenarios including articulated structures, shadow reflections, dynamic background motion, and drastic camera movement.

    Visual Generation: Generative Frameworks, World Models, Digital Humans...

    dise SVG: Latent Diffusion Model without Variational Autoencoder
    Minglei Shi*, Haolin Wang*, Wenzhao Zheng†, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
    International Conference on Learning Representations (ICLR), 2026.
    [arXiv] [Code] [Model] [Project Page] [中文解读 (in Chinese)]

    SVG is a latent diffusion model without variational autoencoders to unleash Self-supervised representations for Visual Generation.

    dise Astra: General Interactive World Model with Autoregressive Denoising
    Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng†, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu
    International Conference on Learning Representations (ICLR), 2026.
    [arXiv] [Code] [Model] [Project Page]

    Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.

    AI Safety: Visual Content Forensics, Transparent Models...

    dise GenWorld: Towards Detecting AI-generated Real-world Simulation Videos
    Weiliang Chen, Wenzhao Zheng†, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, Yueqi Duan
    arXiv, 2025.
    [arXiv] [Code] [Project Page]

    GenWorld features three key characteristics: 1) Real-world Simulation, 2) High Quality, and 3) Cross-prompt Diversity, which can serve as a foundation for AI-generated video detection research with practical significance.

    dise Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
    Yifei Li, Wenzhao Zheng†, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
    arXiv, 2025.
    [arXiv] [Code] [Model] [Project Page]

    Skyra focuses on Grounded Artifact Reasoning to simultaneously perform Artifact Perception, Spatio-Temporal Grounding, and Explanatory Reasoning.

    Spatial Understanding: 4D Reconstruction, Native 4D Generation...

    dise Streaming 4D Visual Geometry Transformer
    Dong Zhuo*, Wenzhao Zheng*†, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
    International Conference on Learning Representations (ICLR), 2026.
    [arXiv] [Code] [Project Page]

    StreamVGGT employs temporal causal attention and leverages cached token memory to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applications.

    dise Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
    Yuqi Wu*, Wenzhao Zheng*†, Jie Zhou, Jiwen Lu
    Conference on Neural Information Processing Systems (NeurIPS), 2025.
    [arXiv] [Code] [Project Page]

    Point3R is an online framework for dense streaming 3D reconstruction using explicit spatial memory, which achieves competitive performance with low training costs.

    Autonomou Driving [Page]: 3D Occupancy Prediction, End-to-End Driving, 4D Driving Simulation...

    dise Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
    Yuanhui Huang* , Wenzhao Zheng*†, Yunpeng Zhang , Jie Zhou, Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    Given only surround-camera motorcycle RGB images barrier as inputs, our model (trained using trailer only sparse traffic cone LiDAR point supervision) can predict the semantic occupancy for all volumes in the 3D space.

    dise OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
    Wenzhao Zheng*†, Weiliang Chen*, Yuanhui Huang, Borui Zhang, Yueqi Duan, Jiwen Lu
    European Conference on Computer Vision (ECCV), 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    OccWorld models the joint evolutions of 3D scenes and ego movements and paves the way for interpretable end-to-end large driving models.

    dise GenAD: Generative End-to-End Autonomous Driving
    Wenzhao Zheng*, Ruiqi Song* , Xianda Guo* , Chenming Zhang , Long Chen
    European Conference on Computer Vision (ECCV), 2024.
    [arXiv] [Code] [中文解读 (in Chinese)]

    GenAD casts end-to-end autonomous driving as a generative modeling problem.

    dise Doe-1: Closed-Loop Autonomous Driving with Large World Model
    Wenzhao Zheng*†, Zetian Xia*, Yuanhui Huang , Sicheng Zuo, Jie Zhou, Jiwen Lu
    arXiv, 2024.
    [arXiv] [Code] [Project Page] [中文解读 (in Chinese)]

    Doe-1 is the first closed-loop autonomous driving model for unified perception, prediction, and planning.

    Embodied Robots: Vision-Language-Action Models, Embodied Simulation...

    dise EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
    Yuqi Wu*, Wenzhao Zheng*†, Sicheng Zuo, Yuanhui Huang , Jie Zhou, Jiwen Lu
    IEEE International Conference on Computer Vision (ICCV), 2025.
    [arXiv] [Code] [Project Page]

    EmbodiedOcc formulates an embodied 3D occupancy prediction task and employs a Gaussian-based framework to accomplish it.

    dise SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
    Chaojun Ni*, Cheng Chen* , Xiaofeng Wang* , Zheng Zhu* , Wenzhao Zheng, Boyuan Wang, Tianrun Chen , Guosheng Zhao , Haoyun Li , Zhehao Dong, Qiang Zhang , Yun Ye , Yang Wang , Guan Huang , Wenjun Mei
    arXiv, 2025.
    [arXiv] [Code] [Project Page]

    SwiftVLA integrates 4D spatiotemporal information into a lightweight vision-language-action model at minimal costs.

    My Ph.D. Research Topic

    Deep Metric Learning

    dise Introspective Deep Metric Learning
    Chengkun Wang* , Wenzhao Zheng*†, Zheng Zhu, Jie Zhou, Jiwen Lu
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2024.
    [arXiv] [Code]

    We propose an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images.

    dise Deep Metric Learning with Adaptively Composite Dynamic Constraints
    Wenzhao Zheng, Jiwen Lu, Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2023.
    [PDF]

    This paper formulates deep metric learning under a unified framework and propose a dynamic constraint generator to produce adaptive composite constraints to train the metric towards good generalization.

    dise Hardness-Aware Deep Metric Learning
    Wenzhao Zheng, Zhaodong Chen , Jiwen Lu, Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019 (oral).
    Wenzhao Zheng, Jiwen Lu, Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF: 24.31), 2021.
    [PDF] [PDF] (Journal version) [Code]

    We perform linear interpolation on embeddings to adaptively manipulate their hardness levels and generate corresponding label-preserving synthetics for recycled training.

    dise Deep Adversarial Metric Learning
    Yueqi Duan, Wenzhao Zheng, Xudong Lin , Jiwen Lu, Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018 (spotlight).
    Yueqi Duan, Jiwen Lu, Wenzhao Zheng, Jie Zhou
    IEEE Transactions on Image Processing (T-IP, IF: 11.041), 2020.
    [PDF] [PDF] (Journal version) [Code]

    We generate potential hard negatives adversarial to the learned metric as complements.

    dise Structural Deep Metric Learning for Room Layout Estimation
    Wenzhao Zheng, Jiwen Lu Jie Zhou
    European Conference on Computer Vision (ECCV), 2020.
    [PDF]

    We are the first to apply deep metric learning to prediction tasks with structured labels.

    Honors and Awards

  • 2024 Excellent Doctoral Dissertation of Chinese Association for Artificial Intelligence
  • 2023 Tsinghua Excellent Doctoral Dissertation Award
  • 2023 Beijing Outstanding Graduate
  • 2023 Tsinghua Outstanding Graduate
  • 2022 Xuancheng Scholarship
  • 2021 National Scholarship (highest scholarship given by the government of China)
  • 2021 CVPR Outstanding Reviewer
  • 2020 Changtong Scholarship (highest scholarship in the Dept. of Automation)
  • 2019 National Scholarship (highest scholarship given by the government of China)
  • 2017 Tung OOCL Scholarship
  • 2016 German Scholarship
  • Academic Services

  • Conference Reviewer / PC Member: CVPR 2019-2026, ICCV 2019-2025, ECCV 2020-2026, NeurIPS 2023-2025, ICLR 2024-2025, ICML 2025, IJCAI 2020-2025, WACV 2020-2025, ICME 2019-2025,
  • Senior PC Member: IJCAI 2021
  • Journal Reviewer: T-PAMI, T-IP, T-MM, T-CSVT, T-NNLS, T-BIOM, T-IST, Pattern Recognition, Pattern Recognition Letters

  • Website Template


    © Wenzhao Zheng | Last updated: Feb. 1, 2026.