Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Abstract: Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MMVUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques, but also to highlight their advanced capabilities in diverse learning paradigms. These paradigms include openworld understanding, efficient transfer for road scenes, continual learning, interactive and generative capability. Moreover, we provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.

Authors: Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, Ling Shao, Yi Yang, Bojun Gao, Qun Li and Guobin Wu

📖Table of Contents

Overview
News
Roadmap
Paper Collection
Acknowledgement & Citation

😋Overview

This is an overview of our survey as below where we delve into MM-VUFMs from required prerequisites, currently common practices, advanced foundation models from diverse learning paradigms, key challenges and future trends.

We also systematicaly review currently common practices of visual understanding on road scenes from task-specific models, unified multi-task models, unified multi-modal models and prompting foundation models, respectively. Overview of common practices

Moreover, advanced capabilities on diverse learning paradigms are highlighted as below, involving open-world understanding, efficient transfer for road scenes, continual learning, learn to interact and generative foundation models, respectively. Overview of advance foundation models

💥News

[2024.05.26] Our survey is accepted to IEEE Transactions on Intelligent Vehicles (T-IV).
[2024.02.05] Our survey is available at hear.

🗺️Roadmap

Roadmap

📚Paper Collection

Related Surveys
Task-specific Models
- From instance-level perception to global-level understanding
- From closed-set condition to open-set condition
- From single modality to multi-modalities
Unified Multi-task Models
- Task-specific outputs
- Unified language outputs
Unified Multi-modal Models
- LLM functions as sequence modeling
- Cross-modal interaction in VLM
Prompting Foundation Models
- Textual prompt
- Visual prompt
- Multi-step prompt
- Task-specific prompt
- Prompt pool
Related Datasets
Towards Open-world Understanding
Efficient Transfer for Road Scenes
Continual Learning
Learn to Interact
Generative Foundation Models
Closed-loop Driving Systems
Interpretability
Low-resource Condition
Embodied Driving Agent
World Model

💗Acknowledgement & Citation

This work was supported by DiDi GAIA Research Cooperation Initiative. If you find this work useful, please consider cite:

@article{luo2024delving,
  title={Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives},
  author={Luo, Sheng and Chen, Wei and Tian, Wanxin and Liu, Rui and Hou, Luanxuan and Zhang, Xiubao and Shen, Haifeng and Wu, Ruiqi and Geng, Shuyi and Zhou, Yi and others},
  journal={arXiv preprint arXiv:2402.02968},
  year={2024}
}

MM-VUFM4DS
MM-VUFM4DS copied to clipboard

Metadata

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

📖Table of Contents

😋Overview

💥News

🗺️Roadmap

📚Paper Collection

💗Acknowledgement & Citation

← Metadata

Owner

Metadata

MM-VUFM4DS MM-VUFM4DS copied to clipboard

Metadata

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

📖Table of Contents

😋Overview

💥News

🗺️Roadmap

📚Paper Collection

💗Acknowledgement & Citation

← Metadata

Owner

Metadata

MM-VUFM4DS
MM-VUFM4DS copied to clipboard