:pencil: List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Empowering Open-Source Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [Paper] [HF Model]

:mega: Note: Our new dataset is complementary to existing training sources, add it to your train set and boost your multimodal LLMs with Set-of-Mark prompting and improved general capacity! No cost at inference time!

:fire: News

[04/26] Thanks AK and HF daily papers for featuring our work!
[04/25] Our paper is on arxiv! [Paper]
[04/23] Models and datasets of SoM-LLaVA are released! [HF Model] [Dataset]

:scroll: Contents

Results
Dataset
Model Weights
Showcases
Training
Using SoM

:bar_chart: Results

Method	LLM	POPE	MME	SEED-I	LLaVA-Wild	MM-VET
BLIP-2	Vicuna-13B	85.3	1293.8	49.7	38.1	22.4
LLaVA-1.5	Vicuna-13B	85.9	1531.3	68.2	70.7	35.4
SoM-LLaVA-1.5	Vicuna-13B	86.6	1563.1	69.6	75.3	35.9
SoM-LLaVA-1.5 w/ tags	Vicuna-13B	87.0	1572.8	69.5	73.3	37.2

:mega: Note: We get 1% to 6% relative improvements on all benchmarks, by simply adding 30k SoM data to the visual instruction tuning stage of LLaVA. SoM-LLaVA-1.5 w/ tags is to feed the model with tagged images, but you can enjoy the performance gain even without the extra tags at test time!

:seedling: SoM Dataset

[Training data for SoM-LLaVA]

som_llava_mix695k.json: Full SFT data with llava-665k + SoM-30k

som_listing_coco10k.json: listing all items with SoM images.

som_qa_coco20k.json: QA with SoM images. (Note: QA used the same 10k images from listing, with another batch of 10k added.)

som_train2017.zip: A subset of 20k coco images that is annotated with SoM, used in our data construction.

:cake: Model Checkpoints

We release our main model, SoM-LLaVA trained with LLaVA-665k and SoM-style Listing + QA data.

[SoM-LLaVA-v1.5-13B]

Two additional models for ablation study:

[SoM-LLaVA-v1.5-13B-listing]

[SoM-LLaVA-v1.5-13B-qa]

:dango: Showcases

:mushroom: Training

We adopt the training code of LLaVA. Please set up environments following the instructions. Currently our data is used in the Visual Instruction Tuning stage.

Prepare data

Please download the annotation of the final mixture our instruction tuning data som_llava_mix695k.json , and download the images from constituting datasets:

COCO: train2017
COCO: som_train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in your data folder.

├── coco
│   ├── train2017
│   └── som_train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

Training

After downloading our data (or preparing your own SoM data), train SoM-LLaVA via command line:

bash scripts/v1_5/finetune.sh

:snowflake: Using SoM

Note: Our implementation is improved over the original SoM repo, by removing overlapping regions for each mask (otherwise there will be confilicts/overlaps for tag positions).

Init virtual envs

# create env. Note: must use 3.10, 3.11 will cause package conflicts.
conda create -n som python=3.10 -y
conda activate som

Install libgeos if there is error installing SEEM

sudo apt-get update
sudo apt-get install libgeos-c1v5 libgeos-dev

Install segmentation packages

# download repo and navigate to SoM folder
git clone https://github.com/zzxslp/SoM-LLaVA.git
cd ~/SoM-LLaVA/SoM/

# install PyTorch
pip3 install torch torchvision torchaudio

# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

# install additional packages
pip install datasets

Download the pretrained models

sh download_ckpt.sh

Annotate COCO images with SoM

python annotate_coco.py

:cat: Citation

If you find our data or model useful for your research and applications, please cite our paper:

@article{yan2024list,
  title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
  author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
  journal={arXiv preprint arXiv:2404.16375},
  year={2024}
}

:beers: Acknowledgments

This project is a collaborative work between UC San Diego and Microsoft GenAI, built on top of LLaVA and SoM. Thank the authors for their contributions to the community!

SoM-LLaVA
SoM-LLaVA copied to clipboard

Metadata

:pencil: List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

:fire: News

:scroll: Contents

:bar_chart: Results

:seedling: SoM Dataset

:cake: Model Checkpoints

:dango: Showcases

:mushroom: Training

:snowflake: Using SoM

:cat: Citation

:beers: Acknowledgments

← Metadata

Owner

Metadata

SoM-LLaVA SoM-LLaVA copied to clipboard

Metadata

:pencil: List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

:fire: News

:scroll: Contents

:bar_chart: Results

:seedling: SoM Dataset

:cake: Model Checkpoints

:dango: Showcases

:mushroom: Training

:snowflake: Using SoM

:cat: Citation

:beers: Acknowledgments

← Metadata

Owner

Metadata

SoM-LLaVA
SoM-LLaVA copied to clipboard