SoM-LLaVA
SoM-LLaVA copied to clipboard
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
:pencil: List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Empowering Open-Source Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [Paper] [HF Model]
:mega: Note: Our new dataset is complementary to existing training sources, add it to your train set and boost your multimodal LLMs with Set-of-Mark prompting and improved general capacity! No cost at inference time!
:fire: News
- [04/26] Thanks AK and HF daily papers for featuring our work!
- [04/25] Our paper is on arxiv! [Paper]
- [04/23] Models and datasets of SoM-LLaVA are released! [HF Model] [Dataset]
:scroll: Contents
- Results
- Dataset
- Model Weights
- Showcases
- Training
- Using SoM
:bar_chart: Results
Method | LLM | POPE | MME | SEED-I | LLaVA-Wild | MM-VET |
BLIP-2 | Vicuna-13B | 85.3 | 1293.8 | 49.7 | 38.1 | 22.4 |
LLaVA-1.5 | Vicuna-13B | 85.9 | 1531.3 | 68.2 | 70.7 | 35.4 |
SoM-LLaVA-1.5 | Vicuna-13B | 86.6 | 1563.1 | 69.6 | 75.3 | 35.9 |
SoM-LLaVA-1.5 w/ tags | Vicuna-13B | 87.0 | 1572.8 | 69.5 | 73.3 | 37.2 |
:mega: Note: We get 1% to 6% relative improvements on all benchmarks, by simply adding 30k SoM data to the visual instruction tuning stage of LLaVA. SoM-LLaVA-1.5 w/ tags is to feed the model with tagged images, but you can enjoy the performance gain even without the extra tags at test time!
:seedling: SoM Dataset
som_llava_mix695k.json: Full SFT data with llava-665k + SoM-30k
som_listing_coco10k.json: listing all items with SoM images.
som_qa_coco20k.json: QA with SoM images. (Note: QA used the same 10k images from listing, with another batch of 10k added.)
som_train2017.zip: A subset of 20k coco images that is annotated with SoM, used in our data construction.
:cake: Model Checkpoints
We release our main model, SoM-LLaVA trained with LLaVA-665k and SoM-style Listing + QA data.
Two additional models for ablation study:
:dango: Showcases
:mushroom: Training
We adopt the training code of LLaVA. Please set up environments following the instructions. Currently our data is used in the Visual Instruction Tuning stage.
- Prepare data
Please download the annotation of the final mixture our instruction tuning data som_llava_mix695k.json , and download the images from constituting datasets:
- COCO: train2017
- COCO: som_train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in your data folder.
├── coco
│ ├── train2017
│ └── som_train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
- Training
After downloading our data (or preparing your own SoM data), train SoM-LLaVA via command line:
bash scripts/v1_5/finetune.sh
:snowflake: Using SoM
Note: Our implementation is improved over the original SoM repo, by removing overlapping regions for each mask (otherwise there will be confilicts/overlaps for tag positions).
- Init virtual envs
# create env. Note: must use 3.10, 3.11 will cause package conflicts.
conda create -n som python=3.10 -y
conda activate som
- Install libgeos if there is error installing SEEM
sudo apt-get update
sudo apt-get install libgeos-c1v5 libgeos-dev
- Install segmentation packages
# download repo and navigate to SoM folder
git clone https://github.com/zzxslp/SoM-LLaVA.git
cd ~/SoM-LLaVA/SoM/
# install PyTorch
pip3 install torch torchvision torchaudio
# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..
# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
# install additional packages
pip install datasets
- Download the pretrained models
sh download_ckpt.sh
- Annotate COCO images with SoM
python annotate_coco.py
:cat: Citation
If you find our data or model useful for your research and applications, please cite our paper:
@article{yan2024list,
title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
journal={arXiv preprint arXiv:2404.16375},
year={2024}
}
:beers: Acknowledgments
This project is a collaborative work between UC San Diego and Microsoft GenAI, built on top of LLaVA and SoM. Thank the authors for their contributions to the community!