[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

This is the official implementation of VTC-CLS, a state-of-the-art effective method for training-free visual token compression in Multimodal Large Language Models. Visualization of VTC-CLS Our VTC-CLS is simple and can serve as a plug-and-play method to accelerate the inference of MLLMs in a training free manner, showing high practicality.

News

[x] [2024.12.08] our paper has been submitted to arxiv.org
[x] [2024.12.10] we open-sourced our code!

Environmental Setup

conda create -n VTC-CLS python=3.10
pip install -r requirements.txt

Download LLaVA-1.5-7B and put it at ../models/.

Performance

We tested our VTC-CLS method on various models with different compression ratios, and display LLaVA results here. Compared with existing methods including FastV and LLaVA-prumerge, our method is state-of-the-art in training-free manner.

Efficiency

We measure the evaluation time and show our method can effectively speed-up the inference process of MLLMs. We display the inference time of LLaVA-v1.5-7B on some test datasets before and after applying our VTC-CLS method.

Evaluation

You can simply run scripts under ./scripts/v1_5/eval. You should specify the start layer and the token num to keep in command line(except for reproduce).

GQA

Download the data and evaluation scripts following the official instructions and put under ../data/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/gqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/gqa.sh

ScienceQA

Under ../data/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/sqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/sqa.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ../data/textvqa.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/textvqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/textvqa.sh

POPE

Download coco from POPE and put under ../data.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/pope.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/pope.sh

MMBench

Download mmbench_dev_20230712.tsv and put under ../data/mmbench.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench.sh

Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ../data/mmbench.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench_cn.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench_cn.sh

Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

Following the official instructions to download the images and the videos. Put images under ../data/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/seed.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/seed.sh

MM-Vet

Extract mm-vet.zip to ../data/mmvet.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmvet.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmvet.sh

Evaluate the predictions in ../data/eval/mmvet/results using the official jupyter notebook.

Acknowledgement

Our codebase is partly built with LLaVolta and LLaVA-PruMerge.

Thanks for the great implementations!

Citation

If our code or models help your work, please cite our paper:

@article{wang2024cls,
  title={[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs},
  author={Wang, Ao and Sun, Fengyuan and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang},
  journal={arXiv preprint arXiv:2412.05819},
  year={2024}
}

VTC-CLS
VTC-CLS copied to clipboard

Metadata

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

News

Environmental Setup

Performance

Efficiency

Evaluation

GQA

ScienceQA

TextVQA

POPE

MMBench

MMBench-CN

SEED-Bench

MM-Vet

Acknowledgement

Citation

← Metadata

Owner

Metadata

VTC-CLS VTC-CLS copied to clipboard

Metadata

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

News

Environmental Setup

Performance

Efficiency

Evaluation

GQA

ScienceQA

TextVQA

POPE

MMBench

MMBench-CN

SEED-Bench

MM-Vet

Acknowledgement

Citation

← Metadata

Owner

Metadata

VTC-CLS
VTC-CLS copied to clipboard