Draw-and-Understand icon indicating copy to clipboard operation
Draw-and-Understand copied to clipboard

[ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Second Image SPHINX-V Logo

🎨 Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao

Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang and Hongsheng Li

πŸ’₯ News

  • [2025.01.23] πŸŽ‰ Draw-and-Understand is accepted by ICLR 2025

  • [2024.03.28] πŸ”₯ We released the MDVP-Data dataset, MDVP-Bench benchmark, and the SPHINX-V-13B model.

  • [2024.03.28] πŸš€ We release the arXiv paper.

  • [2024.03.28] πŸš€ We released the traning and evaluation code.

πŸ‘€ Introduction

The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. Therefore, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.


Specifically, the model is named SPHINX-V, a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.


πŸš€ Examples Show

πŸ” Natural Image Domain


πŸ” OCR Image Domain


πŸ” Mobile/Website Screenshot Domain


πŸ” Multi-panel Image Domain


πŸ› οΈ Install

  1. Clone this repository and navigate to Draw-and-Understand folder
git clone https://github.com/AFeng-x/Draw-and-Understand.git
cd Draw-and-Understand
  1. Install packages
# Create a new conda environment named 'sphinx-v' with Python 3.10
conda create -n sphinx-v python=3.10 -y
# Activate the 'sphinx-v' environment
conda activate sphinx-v
# Install required packages from 'requirements.txt'
pip install -r requirements.txt
  1. Optional: Install Flash-Attention
# Draw-and-Understand is powered by flash-attention for efficient attention computation.
pip install flash-attn --no-build-isolation
  1. Install Draw-and-Understand as Python Package
# go to the root directory of Draw-and-Understand
cd Draw-and-Understand
# install Draw-and-Understand
pip install -e .
# After this, you will be able to invoke β€œimport SPHINX_V” without the restriction of working directory.
  1. To enable the segmentation ability shown in our official demo, SAM is also needed:
pip install git+https://github.com/facebookresearch/segment-anything.git

πŸ€–οΈ Checkpoints

SPHINX-V-13b Stage-1 Pre-training Weight: πŸ€—Hugging Face / Baidu

SPHINX-V-13b Stage-2 Fine-tunings Weight: πŸ€—Hugging Face / Baidu

Other required weights and configurations: πŸ€—Hugging Face

Please download them to your own machine. The file structure should appear as follows:

accessory/checkpoints/sphinx-v/stage2
β”œβ”€β”€ consolidated.00-of-02.model.pth
β”œβ”€β”€ consolidated.01-of-02.model.pth
β”œβ”€β”€ tokenizer.model
β”œβ”€β”€ config.json
└── meta.json
accessory/checkpoints/llama-2-13b
β”œβ”€β”€ params.json

accessory/checkpoints/tokenizer
β”œβ”€β”€ tokenizer.model

πŸ“ MDVP-Dataset

  • MDVP-Data is a comprehensive dataset for multi-domain visual-prompt instruction tuning. This dataset encompasses data for both point-level and region-level understanding, designed to enhance a model’s comprehension ability and robustness.

  • Based on MDVP-Data, we also introduce MDVP-Bench, a challenging benchmark designed to evaluate tasks that require a combination of detailed description referrals, inter-relationship analysis, and complex reasoning.


πŸš€ Training

  • Prepare data

    • Please download the annotations of our pre-training data and images. (Refer to the Dataset Preparation)
  • Stage 1: Image-Visual Prompt-Text Alignment Pre-training

    • Download the pretrained SPHINX-v2-1k Weights from Hugging face or Baidu(88z0). Place the model in the "accessory/checkpoints/sphinx-v2-1k" directory.
    • Download the ViT-H SAM model and place the model in the "accessory/checkpoints/sam" directory.
    • Pre-training configuration is vp_pretrain.yaml. Please ensure that all annotations are included and update the image paths in each JSON file to reflect the paths on your machine.
    • Update the model paths in the run script.
    • Run bash scripts/train_sphinx-v_pretrain_stage1.sh.
  • Stage 2: Multi-Task End-to-End Supervised Finetuning

    • Download SPHINX-V Stage-1 Pre-training Weights from πŸ€–οΈCheckpoints. Alternatively, you may use your own model weights trained from Stage 1.
    • Place the model in the "accessory/checkpoints/sphinx-v/stage1" directory.
    • Fine-tuning configuration is vp_finetune.yaml. Please ensure that all annotations are included and update the image paths in each JSON file to reflect the paths on your machine.
    • Update the model paths in the run script.
    • Run bash scripts/train_sphinx-v_finetune_stage2.sh.

πŸ“ˆ Evaluation

See evaluation for details.

πŸ›©οΈ Inference

We provide a simple example for inference in inference.py

You can launch this script with torchrun --master_port=1112 --nproc_per_node=1 inference.py

πŸͺ Host Local Demo

πŸ’» requirments:

  1. For this demo, it needs to prepare the SPHINX-V stage-2 checkpoints and ViT-H SAM model, and place them in the accessory/checkpoints/ directory.
  2. Make sure you have installed Segment Anything.
  3. Run.
cd accessory/demos
bash run.sh

πŸ’Œ Acknowledgement

  • LLaMA-Accessory: the codebase we built upon.
  • SAM: the demo also uses the segmentation result from SAM.

πŸ–ŠοΈ: Citation

If you find our Draw-and-Understand project useful for your research and applications, please kindly cite using this BibTeX:

@misc{lin2024drawandunderstand,
      title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
      year={2024},
      eprint={2403.20271},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}