MM-VUFM4DS
MM-VUFM4DS copied to clipboard
A systematic survey of multi-modal and multi-task visual understanding foundation models for driving scenarios
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives
Abstract: Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MMVUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques, but also to highlight their advanced capabilities in diverse learning paradigms. These paradigms include openworld understanding, efficient transfer for road scenes, continual learning, interactive and generative capability. Moreover, we provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
Authors: Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, Ling Shao, Yi Yang, Bojun Gao, Qun Li and Guobin Wu
📖Table of Contents
- Overview
- News
- Roadmap
- Paper Collection
- Acknowledgement & Citation
😋Overview
This is an overview of our survey as below where we delve into MM-VUFMs from required prerequisites, currently common practices, advanced foundation models from diverse learning paradigms, key challenges and future trends.
We also systematicaly review currently common practices of visual understanding on road scenes from task-specific models, unified multi-task models, unified multi-modal models and prompting foundation models, respectively.
Moreover, advanced capabilities on diverse learning paradigms are highlighted as below, involving open-world understanding, efficient transfer for road scenes, continual learning, learn to interact and generative foundation models, respectively.
💥News
- [2024.05.26] Our survey is accepted to IEEE Transactions on Intelligent Vehicles (T-IV).
- [2024.02.05] Our survey is available at hear.
🗺️Roadmap
📚Paper Collection
- Related Surveys
-
Task-specific Models
- From instance-level perception to global-level understanding
- From closed-set condition to open-set condition
- From single modality to multi-modalities
-
Unified Multi-task Models
- Task-specific outputs
- Unified language outputs
-
Unified Multi-modal Models
- LLM functions as sequence modeling
- Cross-modal interaction in VLM
-
Prompting Foundation Models
- Textual prompt
- Visual prompt
- Multi-step prompt
- Task-specific prompt
- Prompt pool
- Related Datasets
- Towards Open-world Understanding
- Efficient Transfer for Road Scenes
- Continual Learning
- Learn to Interact
- Generative Foundation Models
- Closed-loop Driving Systems
- Interpretability
- Low-resource Condition
- Embodied Driving Agent
- World Model
💗Acknowledgement & Citation
This work was supported by DiDi GAIA Research Cooperation Initiative. If you find this work useful, please consider cite:
@article{luo2024delving,
title={Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives},
author={Luo, Sheng and Chen, Wei and Tian, Wanxin and Liu, Rui and Hou, Luanxuan and Zhang, Xiubao and Shen, Haifeng and Wu, Ruiqi and Geng, Shuyi and Zhou, Yi and others},
journal={arXiv preprint arXiv:2402.02968},
year={2024}
}