VTC-CLS
VTC-CLS copied to clipboard
official repo for paper "[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs"
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
This is the official implementation of VTC-CLS, a state-of-the-art effective method for training-free visual token compression in Multimodal Large Language Models.
Our VTC-CLS is simple and can serve as a plug-and-play method to accelerate the inference of MLLMs in a training free manner, showing high practicality.
News
- [x] [2024.12.08] our paper has been submitted to arxiv.org
- [x] [2024.12.10] we open-sourced our code!
Environmental Setup
conda create -n VTC-CLS python=3.10
pip install -r requirements.txt
- Download LLaVA-1.5-7B and put it at
../models/.
Performance
We tested our VTC-CLS method on various models with different compression ratios, and display LLaVA results here. Compared with existing methods including FastV and LLaVA-prumerge, our method is state-of-the-art in training-free manner.

Efficiency
We measure the evaluation time and show our method can effectively speed-up the inference process of MLLMs. We display the inference time of LLaVA-v1.5-7B on some test datasets before and after applying our VTC-CLS method.

Evaluation
You can simply run scripts under ./scripts/v1_5/eval. You should specify the start layer and the token num to keep in command line(except for reproduce).
GQA
- Download the data and evaluation scripts following the official instructions and put under
../data/gqa/data. You may need to modifyeval.pyas this due to the missing assets in the GQA v1.2 release. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/gqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/gqa.sh
ScienceQA
- Under
../data/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thedata/scienceqafolder of the ScienceQA repo. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/sqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/sqa.sh
TextVQA
- Download
TextVQA_0.5.1_val.jsonand images and extract to../data/textvqa. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/textvqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/textvqa.sh
POPE
- Download
cocofrom POPE and put under../data. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/pope.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/pope.sh
MMBench
- Download
mmbench_dev_20230712.tsvand put under../data/mmbench. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench.sh
- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_20230712.
MMBench-CN
- Download
mmbench_dev_cn_20231003.tsvand put under../data/mmbench. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench_cn.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench_cn.sh
- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.
SEED-Bench
- Following the official instructions to download the images and the videos. Put images under
../data/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/seed.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/seed.sh
MM-Vet
- Extract
mm-vet.zipto../data/mmvet. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmvet.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmvet.sh
- Evaluate the predictions in
../data/eval/mmvet/resultsusing the official jupyter notebook.
Acknowledgement
Our codebase is partly built with LLaVolta and LLaVA-PruMerge.
Thanks for the great implementations!
Citation
If our code or models help your work, please cite our paper:
@article{wang2024cls,
title={[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs},
author={Wang, Ao and Sun, Fengyuan and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang},
journal={arXiv preprint arXiv:2412.05819},
year={2024}
}