LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

A New Powerful Visual Tower for LLaVA!

Open anxiangsir opened this issue 9 months ago • 3 comments

We adopted the official LLaVA-NeXT and the official training dataset LLaVA-NeXT-Data for evaluating the foundational visual models.
The language model is Qwen2.5-7B.

Vision Tower RoPE2D ChartQA DocVQA InfoVQA OCRBench MMMU
CLIP (ViT-L-14-336px) × 66.52 75.21 38.88 525.00 44.20
SigLIP (ViT-SO400M-384px) × 69.28 76.71 41.38 554.00 46.78
DFN5B (ViT-H-14-378px) × 64.36 70.87 38.59 473.00 48.00
MLCD (ViT-L-14-336px) × 67.84 76.46 43.48 531.00 44.30
MLCD (ViT-bigG-14-336px) 71.07 79.63 44.38 572.00 46.78
MLCD (ViT-bigG-14-448px) 73.80 83.34 46.59 582.00 46.00

The results of the ImageNet linear probe are as follows:

Model Name ImageNet Linear Probe Hugging Face
MLCD-ViT-B-32-224px 79.1 HF:MLCD-ViT-B-32-224px
MLCD-ViT-L-14-336px 86.3 HF:MLCD-ViT-L-14-336px
MLCD-ViT-bigG-14-224px 87.1 HF:MLCD-ViT-bigG-14-224px

anxiangsir avatar Feb 12 '25 13:02 anxiangsir

You can find more useful information here.

https://github.com/deepglint/unicom

anxiangsir avatar Feb 12 '25 13:02 anxiangsir

hello @anxiangsir

It's very interesting, I'm currently studying about llava video qwen2 7b, it uses vision model : siglip-so400m-patch14-384, can you share how to switch vision model to use MLCD-ViT-B-32-224px ?

ixn3rd3mxn avatar Feb 13 '25 06:02 ixn3rd3mxn

I have another question. Have you ever used it with LLaVA-Vision-Qwen2-7B? If so, What is the max_frames_num you set? And when you run it (LLaVA-Vision-Qwen2-7B + MLCD-ViT-bigG-14-224px), how much vram gpu does it use? And what gpu do you use?

ixn3rd3mxn avatar Feb 13 '25 07:02 ixn3rd3mxn