DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Official pytorch implementation of DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception.

📚 Paper 🤗 Dataset

Authors: Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan.
Institutes: Peking University; Beijing Academy of Artificial Intelligence; Dalian University of Technology
Dataset: [🤗DenseFusion-4V-100K], [🤗DenseFusion-1M]

📜 News

[2024/07/12] The paper and dataset are released ! 💥

💡 Introduction

"An image is worth a thousand words". Comprehensive image descriptions are essential for multi-modal perception, while images contains various visual elements of different granularities that are challenging to harness.
We propose Perceptural Fusion to integrate the diverse visual perception experts for capturing visual elements and adopt a MLLM as a centric pivot for comprehensive perception.
We thereby provide DenseFusion-1M dataset for highly informative image descriptions with various visual details, including rich OCR information, accurate object and position recognition, and external knowledge, etc.

🛸 Method

Pipeline of Perceptual Fusion to acquire DenseFusion dataset with hyper-detailed image descriptions. This pipeline leverages various visual experts as image priors and employs a multimodal model as the central pivot for integrating multi-source information. Its capability is learned from a 100K meta dataset generated by advanced GPT-4V.

📚 Dataset

We carefully select 1M highly representative images from uncurated LAION dataset through Semantic Clustering and De-duplication.
Through perceptual fusion, we obtain the comprehensive image-text data DenseFusion-4V-100K and DenseFusion-1M.
You can download the dataset from the 🤗Huggingface and images can be obtained from the urls using the ./download/download.py.

Dataset	Captioned by	Link
DenseFusion-4V-100K	GPT-4V	🤗Huggingface
DenseFusion-1M	Ours	🤗Huggingface

Visual examples from DenseFusion-1M, enriched with various detailed visual elements, such as OCR information, object/attribute information, spaital position, and external world knowledge.

🤖 Benchmark Performance

We utilize this highly informative image captions DenseFusion-1M for Pre-training Stage. The training code largely follows LLaVA and ShareGPT4V.

Low-resolution MLLM: LLaVA
High-resolution MLLM: LLaVA-S²

The high-quality image-text data brings consistent and significant improvements, especially for high-resolution MLLMs that require detailed visual information for effective learning.

Model	LLM	SQA^I	VQA^v2	GQA	VQA^T	MME	MMB	SEED^I	POPE	MMVet
LLaVA-7B	Vicuna_7B	66.8	78.5	62.0	58.2	1510	64.3	66.2	85.9	30.5
DenseFusion-7B	Vicuna_7B	69.3	80.8	64.0	62.0	1574	69.2	70.1	86.5	37.8
LLaVA-S²-7B	Vicuna_7B	68.2	79.7	63.3	60.8	1520	66.4	67.2	86.7	34.6
DenseFusion-S²-7B	Vicuna_7B	72.1	81.6	65.3	67.4	1551	70.7	71.1	87.2	37.5

❤️ Acknowledgments

LLaVA, ShareGPT4V: Thanks for their wonderful works and code!
Vicuna: The amazing open-sourced large language model series!
Scales on Scale: S²: The wonderful project for efficient and effective high-resolution MLLM architecture.

✒️ Citation

If DenseFusion is helpful for your research, please consider star ⭐ and citation 📝 :

@article{li2024DenseFusion,
      title={DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception}, 
      author={Xiaotong Li and Fan Zhang and Haiwen Diao and Yueze Wang and Xinlong Wang and Ling-Yu Duan},
      year={2024},
      journal={2407.08303},

DenseFusion
DenseFusion copied to clipboard

Metadata

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

📜 News

💡 Introduction

🛸 Method

📚 Dataset

🤖 Benchmark Performance

❤️ Acknowledgments

✒️ Citation

← Metadata

Owner

Metadata

DenseFusion DenseFusion copied to clipboard

Metadata

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

📜 News

💡 Introduction

🛸 Method

📚 Dataset

🤖 Benchmark Performance

❤️ Acknowledgments

✒️ Citation

← Metadata

Owner

Metadata

DenseFusion
DenseFusion copied to clipboard