DenseFusion icon indicating copy to clipboard operation
DenseFusion copied to clipboard

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Official pytorch implementation of DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception.

📚 Paper 🤗 Dataset

📜 News

[2024/07/12] The paper and dataset are released ! 💥

💡 Introduction

  • "An image is worth a thousand words". Comprehensive image descriptions are essential for multi-modal perception, while images contains various visual elements of different granularities that are challenging to harness.
  • We propose Perceptural Fusion to integrate the diverse visual perception experts for capturing visual elements and adopt a MLLM as a centric pivot for comprehensive perception.
  • We thereby provide DenseFusion-1M dataset for highly informative image descriptions with various visual details, including rich OCR information, accurate object and position recognition, and external knowledge, etc.

🛸 Method

  • Pipeline of Perceptual Fusion to acquire DenseFusion dataset with hyper-detailed image descriptions. This pipeline leverages various visual experts as image priors and employs a multimodal model as the central pivot for integrating multi-source information. Its capability is learned from a 100K meta dataset generated by advanced GPT-4V.

📚 Dataset

  • We carefully select 1M highly representative images from uncurated LAION dataset through Semantic Clustering and De-duplication.
  • Through perceptual fusion, we obtain the comprehensive image-text data DenseFusion-4V-100K and DenseFusion-1M.
  • You can download the dataset from the 🤗Huggingface and images can be obtained from the urls using the ./download/download.py.
Dataset Captioned by Link
DenseFusion-4V-100K GPT-4V 🤗Huggingface
DenseFusion-1M Ours 🤗Huggingface
  • Visual examples from DenseFusion-1M, enriched with various detailed visual elements, such as OCR information, object/attribute information, spaital position, and external world knowledge.

🤖 Benchmark Performance

We utilize this highly informative image captions DenseFusion-1M for Pre-training Stage. The training code largely follows LLaVA and ShareGPT4V.

The high-quality image-text data brings consistent and significant improvements, especially for high-resolution MLLMs that require detailed visual information for effective learning.

Model LLM SQAI VQAv2 GQA VQAT MME MMB SEEDI POPE MMVet
LLaVA-7B Vicuna_7B 66.8 78.5 62.0 58.2 1510 64.3 66.2 85.9 30.5
DenseFusion-7B Vicuna_7B 69.3 80.8 64.0 62.0 1574 69.2 70.1 86.5 37.8
LLaVA-S2-7B Vicuna_7B 68.2 79.7 63.3 60.8 1520 66.4 67.2 86.7 34.6
DenseFusion-S2-7B Vicuna_7B 72.1 81.6 65.3 67.4 1551 70.7 71.1 87.2 37.5

❤️ Acknowledgments

  • LLaVA, ShareGPT4V: Thanks for their wonderful works and code!
  • Vicuna: The amazing open-sourced large language model series!
  • Scales on Scale: S2: The wonderful project for efficient and effective high-resolution MLLM architecture.

✒️ Citation

If DenseFusion is helpful for your research, please consider star ⭐ and citation 📝 :

@article{li2024DenseFusion,
      title={DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception}, 
      author={Xiaotong Li and Fan Zhang and Haiwen Diao and Yueze Wang and Xinlong Wang and Ling-Yu Duan},
      year={2024},
      journal={2407.08303},