DenseFusion
DenseFusion copied to clipboard
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Official pytorch implementation of DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception.
- Authors: Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan.
- Institutes: Peking University; Beijing Academy of Artificial Intelligence; Dalian University of Technology
- Dataset: [🤗DenseFusion-4V-100K], [🤗DenseFusion-1M]
📜 News
[2024/07/12] The paper and dataset are released ! 💥
💡 Introduction
- "An image is worth a thousand words". Comprehensive image descriptions are essential for multi-modal perception, while images contains various visual elements of different granularities that are challenging to harness.
- We propose Perceptural Fusion to integrate the diverse visual perception experts for capturing visual elements and adopt a MLLM as a centric pivot for comprehensive perception.
- We thereby provide DenseFusion-1M dataset for highly informative image descriptions with various visual details, including rich OCR information, accurate object and position recognition, and external knowledge, etc.
🛸 Method
- Pipeline of Perceptual Fusion to acquire DenseFusion dataset with hyper-detailed image descriptions. This pipeline leverages various visual experts as image priors and employs a multimodal model as the central pivot for integrating multi-source information. Its capability is learned from a 100K meta dataset generated by advanced GPT-4V.
📚 Dataset
- We carefully select 1M highly representative images from uncurated LAION dataset through Semantic Clustering and De-duplication.
- Through perceptual fusion, we obtain the comprehensive image-text data DenseFusion-4V-100K and DenseFusion-1M.
- You can download the dataset from the 🤗Huggingface and images can be obtained from the urls using the
./download/download.py
.
Dataset | Captioned by | Link |
---|---|---|
DenseFusion-4V-100K | GPT-4V | 🤗Huggingface |
DenseFusion-1M | Ours | 🤗Huggingface |
- Visual examples from DenseFusion-1M, enriched with various detailed visual elements, such as OCR information, object/attribute information, spaital position, and external world knowledge.
🤖 Benchmark Performance
We utilize this highly informative image captions DenseFusion-1M for Pre-training Stage. The training code largely follows LLaVA and ShareGPT4V.
The high-quality image-text data brings consistent and significant improvements, especially for high-resolution MLLMs that require detailed visual information for effective learning.
Model | LLM | SQAI | VQAv2 | GQA | VQAT | MME | MMB | SEEDI | POPE | MMVet |
---|---|---|---|---|---|---|---|---|---|---|
LLaVA-7B | Vicuna_7B | 66.8 | 78.5 | 62.0 | 58.2 | 1510 | 64.3 | 66.2 | 85.9 | 30.5 |
DenseFusion-7B | Vicuna_7B | 69.3 | 80.8 | 64.0 | 62.0 | 1574 | 69.2 | 70.1 | 86.5 | 37.8 |
LLaVA-S2-7B | Vicuna_7B | 68.2 | 79.7 | 63.3 | 60.8 | 1520 | 66.4 | 67.2 | 86.7 | 34.6 |
DenseFusion-S2-7B | Vicuna_7B | 72.1 | 81.6 | 65.3 | 67.4 | 1551 | 70.7 | 71.1 | 87.2 | 37.5 |
❤️ Acknowledgments
- LLaVA, ShareGPT4V: Thanks for their wonderful works and code!
- Vicuna: The amazing open-sourced large language model series!
- Scales on Scale: S2: The wonderful project for efficient and effective high-resolution MLLM architecture.
✒️ Citation
If DenseFusion is helpful for your research, please consider star ⭐ and citation 📝 :
@article{li2024DenseFusion,
title={DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception},
author={Xiaotong Li and Fan Zhang and Haiwen Diao and Yueze Wang and Xinlong Wang and Ling-Yu Duan},
year={2024},
journal={2407.08303},