LLaVA-NeXT A New Powerful Visual Tower for LLaVA!

A New Powerful Visual Tower for LLaVA!

Open anxiangsir opened this issue 9 months ago • 3 comments

We adopted the official LLaVA-NeXT and the official training dataset LLaVA-NeXT-Data for evaluating the foundational visual models.
The language model is Qwen2.5-7B.

Vision Tower	RoPE2D	ChartQA	DocVQA	InfoVQA	OCRBench	MMMU
CLIP (ViT-L-14-336px)	×	66.52	75.21	38.88	525.00	44.20
SigLIP (ViT-SO400M-384px)	×	69.28	76.71	41.38	554.00	46.78
DFN5B (ViT-H-14-378px)	×	64.36	70.87	38.59	473.00	48.00
MLCD (ViT-L-14-336px)	×	67.84	76.46	43.48	531.00	44.30
MLCD (ViT-bigG-14-336px)	√	71.07	79.63	44.38	572.00	46.78
MLCD (ViT-bigG-14-448px)	√	73.80	83.34	46.59	582.00	46.00

The results of the ImageNet linear probe are as follows:

Model Name	ImageNet Linear Probe	Hugging Face
MLCD-ViT-B-32-224px	79.1	HF:MLCD-ViT-B-32-224px
MLCD-ViT-L-14-336px	86.3	HF:MLCD-ViT-L-14-336px
MLCD-ViT-bigG-14-224px	87.1	HF:MLCD-ViT-bigG-14-224px

Feb 12 '25 13:02 anxiangsir

You can find more useful information here.

https://github.com/deepglint/unicom

Feb 12 '25 13:02 anxiangsir

hello @anxiangsir

It's very interesting, I'm currently studying about llava video qwen2 7b, it uses vision model : siglip-so400m-patch14-384, can you share how to switch vision model to use MLCD-ViT-B-32-224px ?

Feb 13 '25 06:02 ixn3rd3mxn

I have another question. Have you ever used it with LLaVA-Vision-Qwen2-7B? If so, What is the max_frames_num you set? And when you run it (LLaVA-Vision-Qwen2-7B + MLCD-ViT-bigG-14-224px), how much vram gpu does it use? And what gpu do you use?

Feb 13 '25 07:02 ixn3rd3mxn

LLaVA-NeXT LLaVA-NeXT copied to clipboard

A New Powerful Visual Tower for LLaVA!

LLaVA-NeXT
LLaVA-NeXT copied to clipboard