DSify icon indicating copy to clipboard operation
DSify copied to clipboard

Boosting Driving Scene Understanding with Advanced Vision-Language Models

🚦DriveScenify: Boosting Driving Scene Understanding with Advanced Vision-Language Models

1SpacetimeLab, University College London, UK Β Β  2IIAU-Lab, Dalian University of Technology, China Β Β  3Key Lab of High Confidence Software Technologies, Peking Universtiy, China Β Β  4Data Science Institute, London School of Economics and Political Science, UK Β Β  5National Heart & Lung Institute, Imperial College London, UK

Demo πŸ“°

The demo link may sometimes expire, but don't worry, we will update it in a timely manner, and a retrained version is about to be launched. Have fun!πŸ˜‰

In addition, our model does not have a strong understanding of general videos (although there are also some), so if you want to try general videos, I think Video-LLaMA is a great project better than us.

Introduction πŸ“š

The increasing complexity of traffic situations, coupled with the rapid growth of urban populations, necessitates the development of innovative solutions that can mitigate congestion, reduce traffic-related accidents, and facilitate smoother transportation systems. Recognizing the significant impact of ChatGPT and computer vision technologies on various domains, it is timely to investigate how these advancements can be harnessed to address the critical challenges in urban transportation safety and efficiency.

With this motivation, we introduce DriveScenify, an approach that aims to boost driving scene understanding by leveraging advanced vision-language models. Our research focuses on developing a tailored version of MiniGPT-4, called DSify, which is specifically designed to process and generate contextually relevant responses based on driving scene videos. DriveScenify's integration of advanced vision-language models into the realm of transportation aims to unlock new possibilities for improving urban mobility, reducing traffic-related accidents, and enhancing the overall driving experience.

Furthermore, our unique combination among various encoders enables DSify to provide accurate and context-aware insights, which can be applied to various transportation applications, especially for traffic management, and road safety analysis.

image

Some current shortcomings

  1. Why did the model not answer the behavior in the video?!

In fact, this is very likely to happen because we sampled the video frames (8 frames), which may miss the time period of the event.

  1. The model is dreaming!

Yes, the current version of the model always outputs information that does not exist in the images/videos, but don't worry, please believe that we will optimize it and do better. :)

Features 🌟

  • Spatial-temporal Safe Driving Scene Comprehension: DriveScenify is meticulously developed to accurately interpret diverse driving scenarios, encompassing traffic patterns, vehicle classifications, road conditions and temporal information, with a particular emphasis on promoting driving safety.
  • Contextual Response Formulation: The model is capable of generating context-sensitive responses and recommendations derived from the driving scene, offering valuable guidance to users.
  • While our central focus lies in training DSify using driving scenario videos, the model also exhibits a degree of competence in understanding and processing general video content. This versatility enhances its potential applications across a broader range of domains while maintaining its primary objective of improving driving safety and scene understanding.

Example πŸ’¬

demo

Usage πŸ’»

DriveScenify was initially designed to comprehend corner cases and potentially hazardous situations within driving scenes. Our aim was to leverage the capabilities of Large Language Models (LLMs) to enhance the reasoning process for video understanding, providing a more comprehensive analysis of complex and challenging driving scenarios.

If you want to try the demo of this repo, you only need to refer to the installation process of MiniGPT-4, prepare the environment and Vicuna weights.

Then change the ckpt path in eval_configs/minigpt4_eval.yaml. You can download our weight here. Checkpoint Aligned with Vicuna 13B.

Launching Demo Locally

Try out our demo demo_video.py on your local machine by running

python demo_video.py --cfg-path eval_configs/minigpt4_eval.yaml

In fact, the demo supports both image and video inputs, so feel free to give it a try, even though the file is named "demo_video". Have fun exploring! πŸ˜„πŸŽ‰πŸ“·πŸŽ₯

Upcoming Tasks πŸ€–

  • [ ] Strong video foundation model.
  • [ ] Training with dialogue datasets.
  • [ ] Expanding data generation capabilities.
  • [ ] ...

Contributing 🀝

At present, DriveScenify is in its initial stages, and in many cases, the performance may not be as ideal as expected. Data generation is still ongoing, and we are continuously working to improve the model. We highly appreciate and welcome contributions from the community to help enhance DriveScenify's capabilities and performance.

License πŸ“„

This repository is under BSD 3-Clause License. Many codes are based on Lavis with BSD 3-Clause License here.

Acknowledgments 🀝

We would like to thank the developers of MiniGPT-4, LLaVA, InternVideo, Ask-Anything, Image2Paragraph and Vicuna for their incredible work and providing the foundation for DriveScenify.

Citation πŸ“

If you find this repository useful in your research, please cite our repo:

@software{drivescenify2023multimodal,
  author = {Gao, Xiaowei and Li, Pengxiang and Jiang, xinke and Haworth, James and Cardoso-Silva, Jonathan and Li, Ming},
  title = {DriveScenify: Boosting Driving Scene Understanding with Advanced Vision-Language Models},
  year = 2023,
  url = {https://github.com/pixeli99/DSify}
}