
:sunglasses: Awesome 3D and 4D World Models
This survey reviews state-of-the-art 3D and 4D world models - systems that learn, predict, and simulate the geometry and dynamics of real environments from multi-modal signals.
We unify terminology, scope, and evaluations, and organize the space into three complementary paradigms by representation:
|
|
 |
Learn generative or predictive models from sequential video streams with geometric and temporal constraints. VideoGen focuses on long-horizon consistency, controllability, and scene-level generation, enabling agents to imagine or forecast plausible video rollouts. |
 |
Model 3D/4D occupancy grids that encode geometry and semantics in voxel space. OccGen provides a physics-consistent scaffold for robust perception, forecasting, and simulation, bridging low-level sensor data and high-level reasoning. |
 |
Leverage point cloud sequences from LiDAR sensors to generate or predict geometry-grounded scenes. LiDARGen emphasizes high-fidelity 3D structure, robustness to environment changes, and applications in safety-critical domains such as autonomous driving. |
|
|
For more details, kindly refer to our paper and project page. :rocket:
:books: Citation
If you find this work helpful for your research, please kindly consider citing our paper:
@article{survey_3d_4d_world_models,
title = {3D and 4D World Modeling: A Survey},
author = {Lingdong Kong and Wesley Yang and Jianbiao Mei and Youquan Liu and Ao Liang and Dekai Zhu and Dongyue Lu and Wei Yin and Xiaotao Hu and Mingkai Jia and Junyuan Deng and Kaiwen Zhang and Yang Wu and Tianyi Yan and Shenyuan Gao and Song Wang and Linfeng Li and Liang Pan and Yong Liu and Jianke Zhu and Wei Tsang Ooi and Steven C. H. Hoi and Ziwei Liu},
journal = {arXiv preprint arXiv:2509.07996},
year = {2025},
}
Table of Contents
- 0. Background
- 1. Benchmarks & Datasets
- Benchmarks
- Workshops
- Datasets
- 2. World Modeling from Video Generation
- Data Engines
- Action Interpreters
- Neural Simulators
- Scene Reconstructors
- 3. World Modeling from Occupancy Generation
- Scene Representors
- Occupancy Forecasters
- Autoregressive Simulators
- 4. World Modeling from LiDAR Generation
- Data Engines
- Action Forecasters
- Autoregressive Simulators
- 5. Applications
- Autonomous Driving
- Robotics
- Video Games & XR
- Digital Twins
- 6. Other Resources
- Tutorials
- Talks & Seminars
- 7. Acknowledgements
Background
|
|
 |
World modeling has become a cornerstone of modern AI, enabling agents to understand, represent, and predict dynamic environments. While prior research has focused primarily on 2D images and videos, the rapid emergence of native 3D and 4D representations (e.g., RGB-D, occupancy grids, LiDAR point clouds) calls for a dedicated study. |
|
|
What Are Native 3D Representations?
Unlike 2D projections, native 3D/4D signals directly encode metric geometry, visibility, and motion in the physical coordinates where agents act. Examples include:
- RGB-D imagery (2D images with depth channels)
- Occupancy grids (voxelized maps of free vs. occupied space)
- LiDAR point clouds (3D coordinates from active sensing)
- Neural fields (e.g., NeRF, Gaussian Splatting)
What Are World Models in 3D and 4D?
A 3D/4D world model is an internal representation that allows an agent to imagine, forecast, and interact with its environment in the 3D space.
|
|
 |
Generative World Models: synthesize plausible 3D/4D worlds under conditions (e.g., text prompts, trajectories). |
 |
Predictive World Models: anticipate the future evolution of 3D/4D scenes given past observations and actions. |
|
|
Together, these models provide the foundation for simulation, planning, and embodied intelligence in complex environments.
1. Benchmarks & Datasets
Benchmarks
Workshops
| Theme |
Venue |
Date |
Location |
Recording |
| Workshop on World Modeling |
- |
February 4-6, 2026 |
MontrΓ©al |
- |
| Workshop on Embodied World Models for Decision Making |
NeurIPS 2025 |
December 6, 2025 |
San Diego |
- |
| Workshop on Reliable and Interactable World Models: Geometry, Physics, Interactivity and Real-World Generalization |
ICCV 2025 |
October 19, 2025 |
Hawai'i |
- |
| Workshop on Building Physically Plausible World Models |
ICML 2025 |
July 19, 2025 |
Vancouver |
- |
| Workshop on Assessing World Models |
ICML 2025 |
July 18, 2025 |
Vancouver |
- |
| Workshop on Benchmarking World Models |
CVPR 2025 |
June 12, 2025 |
Nashville |
- |
| Workshop on World Models: Understanding, Modelling and Scaling |
ICLR 2025 |
April 28, 2025 |
Singapore |
- |
| Workshop on Foundation Models for Autonomous Systems |
CVPR 2024 |
June 17, 2025 |
Seattle |
[YouTube] |
Datasets
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
|
|
|
|
KITTI |
Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite |
CVPR 2012 |
 |
NYUv2 |
Indoor Segmentation and Support Inference from RGBD Images |
ECCV 2012 |
 |
CARLA |
 CARLA: An Open Urban Driving Simulator |
CoRL 2017 |
 |
SemanticKITTI |
 SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences |
ICCV 2019 |
 |
nuScenes |
 nuScenes: A Multimodal Dataset for Autonomous Driving |
CVPR 2020 |
 |
Waymo Open |
 Scalability in Perception for Autonomous Driving: Waymo Open Dataset |
CVPR 2020 |
 |
STF |
 Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather |
CVPR 2020 |
 |
Virtual KITTI 2 |
 Virtual KITTI 2 |
arXiv 2020 |
 |
Argoverse 2 |
 Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting |
NeurIPS 2021 |
 |
Lyft-Level5 |
 One Thousand and One Hours: Self-Driving Motion Prediction Dataset |
CoRL 2021 |
 |
nuPlan |
 nuPlan: A Closed-Loop ML-Based Planning Benchmark for Autonomous Vehicles |
CVPRW 2021 |
 |
PandaSet |
 PandaSet: Advanced Sensor Suite Dataset for Autonomous Driving |
ITSC 2022 |
 |
OpenCOOD |
 OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication |
ICRA 2022 |
 |
KITTI-360 |
 KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D |
TPAMI 2022 |
 |
CarlaSC |
 MotionSC: Data Set and Network for Real-Time Semantic Mapping in Dynamic Environments |
RA-L 2022 |
 |
Robo3D |
 Robo3D: Towards Robust and Reliable 3D Perception against Corruptions |
ICCV 2023 |
 |
OpenOccupancy |
 OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception |
ICCV 2023 |
 |
Occ3D-nuScenes |
 Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving |
NeurIPS 2023 |
 |
OpenDV-YouTube |
 GenAD: Generalized Predictive Model for Autonomous Driving |
CVPR 2024 |
 |
SSCBench |
 SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving |
IROS 2024 |
 |
NAVSIM |
 NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking |
NeurIPS 2024 |
 |
DrivingDojo |
 DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model |
NeurIPS 2024 |
 |
EUVS |
 Extrapolated Urban View Synthesis Benchmark |
ICCV 2025 |
 |
Pi3DET |
 Perspective-Invariant 3D Object Detection |
ICCV 2025 |
 |
2. World Modeling from Video Generation
:one: Data Engines
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
BEVControl |
 BEVControl: Accurately Controlling Street-View Elements with Multi-Perspective Consistency via BEV Sketch Layout |
arXiv 2023 |
- |
- |
BEVGen |
 Street-View Image Generation from a Bird's-Eye View Layout |
RA-L 2024 |
 |
 |
MagicDrive |
 MagicDrive: Street View Generation with Diverse 3D Geometry Control |
ICLR 2024 |
 |
 |
Panacea |
 Panacea: Panoramic and Controllable Video Generation for Autonomous Driving |
CVPR 2024 |
 |
 |
DrivingDiffusion |
 DrivingDiffusion: Layout-Guided Multi-View Driving Scene Video Generation with Latent Diffusion Model |
ECCV 2024 |
 |
 |
WoVoGen |
 WoVoGen: World Volume-Aware Diffusion for Controllable Multi-Camera Driving Scene Generation |
ECCV 2024 |
- |
 |
Delphi |
 Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation |
arXiv 2024 |
 |
 |
SimGen |
 SimGen: Simulator-Conditioned Driving Scene Generation |
NeurIPS 2024 |
 |
 |
BEVWorld |
 BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents |
arXiv 2024 |
- |
- |
Panacea+ |
 Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving |
arXiv 2024 |
 |
- |
DiVE |
 DiVE: DiT-Based Video Generation with Enhanced Control |
arXiv 2024 |
 |
 |
SyntheOcc |
 SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs |
arXiv 2024 |
 |
 |
HoloDrive |
 HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving |
arXiv 2024 |
- |
- |
CogDriving |
 Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention |
arXiv 2024 |
 |
- |
UniMLVG |
 UniMLVG: Unified Framework for Multi-View Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving |
arXiv 2024 |
- |
 |
DrivePhysica |
 Physical Informed Driving World Model |
arXiv 2024 |
 |
- |
DriveDreamer-2 |
 DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation |
AAAI 2025 |
 |
 |
SubjectDrive |
 SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control |
AAAI 2025 |
 |
- |
Glad |
 Glad: A Streaming Scene Generator for Autonomous Driving |
ICLR 2025 |
- |
 |
DualDiff |
 DualDiff: Dual-Branch Diffusion Model for Autonomous Driving with Semantic Fusion |
ICRA 2025 |
- |
 |
UniScene |
 UniScene: Unified Occupancy-Centric Driving Scene Generation |
CVPR 2025 |
 |
 |
DriveScape |
 DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation |
CVPR 2025 |
 |
- |
PerLDiff |
 PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models |
ICCV 2025 |
 |
 |
MagicDrive-V2 |
 MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control |
ICCV 2025 |
 |
- |
Cosmos-Transfer1 |
 Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control |
arXiv 2025 |
 |
 |
DualDiff+ |
 DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance |
arXiv 2025 |
- |
 |
CoGen |
 CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving |
arXiv 2025 |
 |
- |
NoiseController |
 NoiseController: Towards Consistent Multi-View Video Generation via Noise Decomposition and Collaboration |
arXiv 2025 |
- |
- |
STAGE |
 STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation |
arXiv 2025 |
- |
- |
|
|
|
|
|
:two: Action Interpreters
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
GAIA-1 |
 GAIA-1: A Generative World Model for Autonomous Driving |
arXiv 2023 |
 |
- |
ADriver-I |
 ADriver-I: A General World Model for Autonomous Driving |
arXiv 2023 |
- |
- |
Drive-WM |
 Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving |
CVPR 2024 |
 |
 |
DriveDreamer |
 DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving |
ECCV 2024 |
 |
 |
GenAD |
 GenAD: Generalized Predictive Model for Autonomous Driving |
CVPR 2024 |
- |
 |
Vista |
 Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability |
NeurIPS 2024 |
 |
 |
InfinityDrive |
 InfinityDrive: Breaking Time Limits in Driving World Models |
arXiv 2024 |
 |
- |
DrivingGPT |
 DrivingGPT: Unifying Driving World Modeling and Planning with Multi-Modal Autoregressive Transformers |
arXiv 2024 |
 |
- |
DrivingWorld |
 DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT |
arXiv 2024 |
 |
 |
GEM |
 GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control |
CVPR 2025 |
 |
 |
MaskGWM |
 MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction |
CVPR 2025 |
- |
 |
Epona |
 Epona: Autoregressive Diffusion World Model for Autonomous Driving |
ICCV 2025 |
 |
 |
VaViM & VaVAM |
 VaViM and VaVAM: Autonomous Driving through Video Generative Modeling |
arXiv 2025 |
 |
 |
MiLA |
 MiLA: Multi-View Intensive-Fidelity Long-Term Video Generation World Model for Autonomous Driving |
arXiv 2025 |
- |
 |
GAIA-2 |
 GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving |
arXiv 2025 |
 |
- |
DriVerse |
 DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment |
arXiv 2025 |
- |
- |
PosePilot |
 PosePilot: Steering Camera Pose for Generative World Models with Self-Supervised Depth |
arXiv 2025 |
- |
- |
ProphetDWM |
 ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos |
arXiv 2025 |
- |
- |
LongDWM |
 LongDWM: Cross-Granularity Distillation for Building A Long-Term Driving World Model |
arXiv 2025 |
 |
 |
|
|
|
|
|
:three: Neural Simulators
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
MagicDrive3D |
 MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes |
arXiv 2024 |
 |
 |
DreamForge |
 DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes |
arXiv 2024 |
 |
 |
Doe-1 |
 Doe-1: Closed-Loop Autonomous Driving with Large World Model |
arXiv 2024 |
 |
 |
DrivingSphere |
 DrivingSphere: Building A High-Fidelity 4D World for Closed-Loop Simulation |
CVPR 2025 |
 |
 |
UMGen |
 Generating Multimodal Driving Scenes via Next-Scene Prediction |
CVPR 2025 |
 |
 |
DriveArena |
 DriveArena: A Closed-Loop Generative Simulation Platform for Autonomous Driving |
ICCV 2025 |
 |
 |
InfiniCube |
 InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models |
ICCV 2025 |
 |
 |
DiST-4D |
 DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation |
ICCV 2025 |
 |
 |
UniFuture |
 Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception |
arXiv 2025 |
 |
 |
Nexus |
 Decoupled Diffusion Sparks Adaptive Scene Generation |
arXiv 2025 |
 |
 |
Challenger |
 Challenger: Affordable Adversarial Driving Video Generation |
arXiv 2025 |
 |
 |
Cosmos-Drive |
 Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models |
arXiv 2025 |
 |
 |
|
|
|
|
|
:four: Scene Reconstructors
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
3DGS |
 3D Gaussian Splatting for Real-Time Radiance Field Rendering |
TOG 2023 |
 |
 |
StreetGaussian |
 Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting |
ECCV 2024 |
 |
 |
4DGF |
 Dynamic 3D Gaussian Fields for Urban Areas |
NeurIPS 2024 |
 |
 |
SCube |
 SCube: Instant Large-Scale Scene Reconstruction using VoxSplats |
NeurIPS 2024 |
 |
 |
HUGS |
 HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting |
CVPR 2024 |
 |
 |
MagicDrive3D |
 MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes |
arXiv 2024 |
 |
 |
S3Gaussian |
 S3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving |
arXiv 2024 |
 |
 |
VDG |
 VDG: Vision-Only Dynamic Gaussian for Driving Simulation |
arXiv 2024 |
 |
 |
UniGaussian |
 UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations |
arXiv 2024 |
- |
- |
Stag-1 |
 Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model |
arXiv 2024 |
 |
 |
DrivingRecon |
 DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving |
arXiv 2024 |
- |
 |
OccScene |
 OccScene: Semantic Occupancy-Based Cross-Task Mutual Learning for 3D Scene Generation |
arXiv 2024 |
- |
- |
SGD |
 SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior |
WACV 2025 |
- |
- |
OmniRe |
 OmniRe: Omni Urban Scene Reconstruction |
ICLR 2025 |
 |
 |
DriveDreamer4D |
 DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation |
CVPR 2025 |
 |
 |
DeSiRe-GS |
 DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes |
CVPR 2025 |
- |
 |
SplatAD |
 SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving |
CVPR 2025 |
 |
 |
ReconDreamer |
 ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration |
CVPR 2025 |
 |
 |
FreeSim |
 FreeSim: Toward Free-Viewpoint Camera Simulation in Driving Scenes |
CVPR 2025 |
 |
- |
StreetCrafter |
 StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models |
CVPR 2025 |
 |
 |
FlexDrive |
 FlexDrive: Toward Trajectory Flexibility in Driving Scene Reconstruction and Rendering |
CVPR 2025 |
- |
- |
S-NeRF++ |
 S-NeRF++: Autonomous Driving Simulation via Neural Reconstruction and Generation |
TPAMI 2025 |
- |
- |
InfiniCube |
 InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models |
ICCV 2025 |
 |
 |
DiST-4D |
 Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation |
ICCV 2025 |
 |
 |
DreamDrive |
 DreamDrive: Generative 4D Scene Modeling from Street View Images |
arXiv 2025 |
 |
- |
Uni-Gaussians |
 Uni-Gaussians: Unifying Camera and Lidar Simulation with Gaussians for Dynamic Driving Scenarios |
arXiv 2025 |
 |
- |
MuDG |
 MuDG: Taming Multi-Modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction |
arXiv 2025 |
 |
 |
UniFuture |
 Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception |
arXiv 2025 |
 |
 |
SceneCrafter |
 Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving Humanoid Robots |
arXiv 2025 |
- |
 |
ReconDreamer++ |
 ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation |
arXiv 2025 |
 |
 |
RealEngine |
 RealEngine: Simulating Autonomous Driving in Realistic Context |
arXiv 2025 |
- |
 |
GeoDrive |
 GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control |
arXiv 2025 |
- |
 |
PseudoSimulation |
 Pseudo-Simulation for Autonomous Driving |
arXiv 2025 |
- |
 |
Dreamland |
 Dreamland: Controllable World Creation with Simulator and Generative Models |
arXiv 2025 |
 |
- |
|
|
|
|
|
3. World Modeling from Occupancy Generation
:one: Scene Representors
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
SSD |
 Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data |
arXiv 2023 |
- |
 |
SemCity |
 SemCity: Semantic Scene Generation with Triplane Diffusion |
CVPR 2024 |
 |
 |
WoVoGen |
 WoVoGen: World Volume-Aware Diffusion for Controllable Multi-Camera Driving Scene Generation |
ECCV 2024 |
- |
 |
UrbanDiff |
 Urban Scene Diffusion through Semantic Occupancy Map |
arXiv 2024 |
 |
- |
DrivingSphere |
 DrivingSphere: Building A High-Fidelity 4D World for Closed-Loop Simulation |
CVPR 2025 |
 |
 |
UniScene |
 UniScene: Unified Occupancy-Centric Driving Scene Generation |
CVPR 2025 |
 |
 |
OccScene |
 OccScene: Semantic Occupancy-Based Cross-Task Mutual Learning for 3D Scene Generation |
arXiv 2024 |
- |
- |
InfiniCube |
 InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models |
ICCV 2025 |
 |
 |
Control-3D-Scene |
 Controllable 3D Outdoor Scene Generation via Scene Graphs |
ICCV 2025 |
 |
 |
X-Scene |
 X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability |
arXiv 2025 |
 |
 |
|
|
|
|
|
:two: Occupancy Forecasters
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
Emergent-Occ |
 Differentiable Raycasting for Self-supervised Occupancy Forecasting |
ECCV 2022 |
- |
 |
FF4D |
 Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting |
CVPR 2023 |
 |
 |
UniWorld |
 UniWorld: Autonomous Driving Pre-Training via World Models |
arXiv 2023 |
- |
- |
UniScene |
 UniScene: Multi-Camera Unified Pre-Training via 3D Scene Reconstruction for Autonomous Driving |
arXiv 2023 |
- |
 |
OccWorld |
 OccWorld: Learning A 3D Occupancy World Model for Autonomous Driving |
ECCV 2024 |
 |
 |
Cam4DOcc |
 Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications |
CVPR 2024 |
- |
 |
DriveWorld |
 DriveWorld: 4D Pre-Trained Scene Understanding via World Models for Autonomous Driving |
CVPR 2024 |
- |
- |
OccSora |
 OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving |
arXiv 2024 |
 |
 |
UnO |
 UnO: Unsupervised Occupancy Fields for Perception and Forecasting |
CVPR 2024 |
 |
- |
LOPR |
 Self-Supervised Multi-Future Occupancy Forecasting for Autonomous Driving |
arXiv 2024 |
- |
- |
FSF-Net |
 FSF-Net: Enhance 4D Occupancy Forecasting with Coarse BEV Scene Flow for Autonomous Driving |
arXiv 2024 |
- |
- |
OccLLaMA |
 OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving |
arXiv 2024 |
- |
- |
DOME |
 DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model |
arXiv 2024 |
 |
 |
GaussianAD |
 GaussianAD: Gaussian-Centric End-to-End Autonomous Driving |
arXiv 2024 |
 |
 |
DFIT-OccWorld |
 An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-Assisted Training |
arXiv 2024 |
- |
- |
Drive-OccWorld |
 Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving |
AAAI 2025 |
 |
 |
PreWorld |
 Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving |
ICLR 2025 |
- |
 |
OccProphet |
 OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework |
ICLR 2025 |
- |
 |
RenderWorld |
 RenderWorld: World Model with Self-Supervised 3D Label |
ICRA 2025 |
- |
- |
Occ-LLM |
 Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models |
ICRA 2025 |
- |
- |
EfficientOCF |
 Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting |
CVPR 2025 |
- |
- |
DIO |
 DIO: Decomposable Implicit 4D Occupancy-Flow World Model |
CVPR 2025 |
- |
- |
TΒ³Former |
 Temporal Triplane Transformers as Occupancy World Models |
arXiv 2025 |
- |
- |
UniOcc |
 UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving |
ICCV 2025 |
 |
 |
IΒ²World |
 IΒ²-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting |
ICCV 2025 |
- |
 |
COME |
 COME: Adding Scene-Centric Forecasting Control to Occupancy World Model |
arXiv 2025 |
- |
 |
|
|
|
|
|
:three: Autoregressive Simulators
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
SemCity |
 SemCity: Semantic Scene Generation with Triplane Diffusion |
CVPR 2024 |
 |
 |
XCube |
 XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies |
CVPR 2024 |
 |
 |
PDD |
 Pyramid Diffusion for Fine 3D Large Scene Generation |
ECCV 2024 |
 |
 |
OccSora |
 OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving |
arXiv 2024 |
 |
 |
DynamicCity |
 DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes |
ICLR 2025 |
 |
 |
DrivingSphere |
 DrivingSphere: Building A High-Fidelity 4D World for Closed-Loop Simulation |
CVPR 2025 |
 |
 |
InfiniCube |
 InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models |
ICCV 2025 |
 |
 |
X-Scene |
 X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability |
arXiv 2025 |
 |
 |
PrITTI |
 PrITTI: Primitive-Based Generation of Controllable and Editable 3D Semantic Scenes |
arXiv 2025 |
 |
 |
4. World Modeling from LiDAR Generation
:one: Data Engines
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
DUSty |
 Learning to Drop Points for LiDAR Scan Synthesis |
IROS 2021 |
 |
 |
LiDARGen |
 Learning to Generate Realistic LiDAR Point Clouds |
ECCV 2022 |
- |
 |
DUSty v2 |
 Generative Range Imaging for Learning Scene Priors of 3D LiDAR Data |
WACV 2023 |
 |
 |
UltraLiDAR |
 UltraLiDAR: Learning Compact Representations for LiDAR Completion and Generation |
CVPR 2023 |
 |
- |
Copilot4D |
 Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion |
ICLR 2024 |
 |
- |
R2DM |
 LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models |
ICRA 2024 |
 |
 |
ViDAR |
 Visual Point Cloud Forecasting enables Scalable Autonomous Driving |
CVPR 2024 |
- |
 |
LiDiff |
 Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion |
CVPR 2024 |
- |
 |
LiDM |
 Towards Realistic Scene Generation with LiDAR Diffusion Models |
CVPR 2024 |
- |
 |
RangeLDM |
 RangeLDM: Fast Realistic LiDAR Point Cloud Generation |
ECCV 2024 |
- |
 |
Text2LiDAR |
 Text2LiDAR: Text-Guided LiDAR Point Cloud Generation via Equirectangular Transformer |
ECCV 2024 |
- |
 |
LiDARGRIT |
 Taming Transformers for Realistic LiDAR Point Cloud Generation |
arXiv 2024 |
- |
 |
BEVWorld |
 BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents |
arXiv 2024 |
- |
 |
SDS |
 Simultaneous Diffusion Sampling for Conditional LiDAR Generation |
arXiv 2024 |
- |
- |
DiffSSC |
 DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models |
IROS 2025 |
- |
- |
HoloDrive |
 HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving |
arXiv 2024 |
- |
- |
LOGen |
 LOGen: Toward LiDAR Object Generation by Point Diffusion |
arXiv 2024 |
 |
 |
OLiDM |
 OLiDM: Object-Aware LiDAR Diffusion Models for Autonomous Driving |
AAAI 2025 |
 |
 |
X-Drive |
 X-Drive: Cross-Modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios |
ICLR 2025 |
- |
 |
LidarDM |
 LidarDM: Generative LiDAR Simulation in a Generated World |
ICRA 2025 |
 |
 |
LiDAR-EDIT |
 LiDAR-EDIT: LiDAR Data Generation by Editing the Object Layouts in Real-World Scenes |
ICRA 2025 |
 |
 |
R2Flow |
 Fast LiDAR Data Generation with Rectified Flows |
ICRA 2025 |
 |
 |
WeatherGen |
 WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion |
CVPR 2025 |
- |
 |
LiDPM |
 LiDPM: Rethinking Point Diffusion for Lidar Scene Completion |
IV 2025 |
 |
 |
HERMES |
 HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation |
ICCV 2025 |
 |
 |
SuperPC |
 SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization |
CVPR 2025 |
 |
- |
SPIRAL |
 SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding |
NeurIPS 2025 |
 |
 |
3DiSS |
 Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving |
arXiv 2025 |
- |
 |
Distill-DPO |
 Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion |
arXiv 2025 |
- |
 |
DriveX |
 DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving |
arXiv 2025 |
- |
- |
OpenDWM |
 OpenDWM: Open Driving World Models |
arXiv 2025 |
- |
 |
La La LiDAR |
 La La LiDAR: Large-Scale Layout Generation from LiDAR Data |
arXiv 2025 |
- |
- |
Veila |
 Veila: Panoramic LiDAR Generation from a Monocular RGB Image |
arXiv 2025 |
- |
- |
LiDARCrafter |
 LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences |
arXiv 2025 |
 |
 |
:two: Action Forecasters
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
Copilot4D |
 Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion |
ICLR 2024 |
 |
- |
ViDAR |
 Visual Point Cloud Forecasting enables Scalable Autonomous Driving |
CVPR 2024 |
- |
 |
BEVWorld |
 BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents |
arXiv 2024 |
- |
 |
HERMES |
 HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation |
ICCV 2025 |
 |
 |
DriveX |
 DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving |
arXiv 2025 |
- |
- |
:three: Autoregressive Simulators
:timer_clock: In chronological order, from the earliest to the latest.
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
HoloDrive |
 HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving |
arXiv 2024 |
- |
- |
LidarDM |
 LidarDM: Generative LiDAR Simulation in a Generated World |
ICRA 2025 |
 |
 |
OpenDWM |
 OpenDWM: Open Driving World Models |
arXiv 2025 |
- |
 |
LiDARCrafter |
 LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences |
arXiv 2025 |
 |
 |
5. Applications
:one: Autonomous Driving
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
OccSora |
 OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving |
arXiv 2024 |
- |
 |
DFIT-OccWorld |
 An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-Assisted Training |
arXiv 2024 |
- |
- |
LiDARCrafter |
 LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences |
arXiv 2025 |
 |
 |
UniSim |
 UniSim: A Neural Closed-Loop Sensor Simulator |
CVPR 2023 |
 |
- |
Panacea |
 Panacea: Panoramic and Controllable Video Generation for Autonomous Driving |
CVPR 2024 |
 |
 |
Delphi |
 Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation |
arXiv 2024 |
 |
 |
DriveDreamer-2 |
 DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation |
AAAI 2025 |
 |
 |
Panacea+ |
 Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving |
arXiv 2024 |
 |
- |
MiLA |
 MiLA: Multi-View Intensive-Fidelity Long-Term Video Generation World Model for Autonomous Driving |
arXiv 2025 |
- |
 |
GAIA-2 |
 GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving |
arXiv 2025 |
 |
- |
:two: Robotics
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
RoboDreamer |
 RoboDreamer: Learning Compositional World Models for Robot Imagination |
Arxiv 2024 |
 |
 |
BEHAVIOR |
 BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities |
CoRL 2025 |
 |
 |
Habitat 2.0 |
 Habitat 2.0: Training Home Assistants to Rearrange their Habitat |
arXiv 2021 |
- |
- |
FMR |
 Foundation Models in Robotics: Applications, Challenges, and the Future |
IJRR 2024 |
- |
 |
VLMPS |
 Visual Language Maps for Robot Navigation |
ICRA 2023 |
 |
 |
:three: Video Games & XR
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
ILVE |
 Interactive Latent Variable Evolution for the Generation of Minecraft Structures |
ICFDG 2021 |
- |
- |
ProcTHOR |
 ProcTHOR: Large-Scale Embodied AI Using Procedural Generation |
NeurIPS 2022 |
 |
 |
WorldGPT |
 WorldGPT: Empowering LLM as Multimodal World Model |
ACM MM 2024 |
- |
 |
WorldExplorer |
 WorldExplorer: Towards Generating Fully Navigable 3D Scenes |
SIGGRAPH Asia 2025 |
 |
 |
Text2World |
 Text2World: Benchmarking Large Language Models for Symbolic World Model Generation |
arXiv 2025 |
 |
 |
FlexWorld |
 FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis |
arXiv 2025 |
 |
 |
Hunyuan-GameCraft |
 Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition |
arXiv 2025 |
 |
 |
HunyuanWorld 1.0 |
 HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels |
arXiv 2025 |
 |
 |
MGVQ |
 MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-Group Quantization |
arXiv 2025 |
- |
 |
EvoWorld |
 EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory |
arXiv 2025 |
- |
 |
:four: Digital Twins
| Model |
Paper |
Venue |
Website |
GitHub |
|
|
|
|
|
DynamicCity |
 DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes |
ICLR 2025 |
 |
 |
UrbanScene3D |
 Capturing, Reconstructing, and Simulating: the UrbanScene3D Datase |
ECCV 2022 |
 |
 |
GaussianCity |
 GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation |
CVPR 2025 |
 |
 |
UrbanWorld |
 UrbanWorld: An Urban World Model for 3D City Generation |
Arxiv 2024 |
 |
 |
SceneDiffuser++ |
 SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model |
CVPR 2025 |
- |
- |
6. Other Resources
Tutorials
Talks & Seminars
7. Acknowledgements