LLaVA-NeXT
LLaVA-NeXT copied to clipboard
A New Powerful Visual Tower for LLaVA!
We adopted the official LLaVA-NeXT and the official training dataset LLaVA-NeXT-Data for evaluating the foundational visual models.
The language model is Qwen2.5-7B.
| Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
|---|---|---|---|---|---|---|
| CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
| SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
| DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | 48.00 |
| MLCD (ViT-L-14-336px) | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
| MLCD (ViT-bigG-14-336px) | √ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
| MLCD (ViT-bigG-14-448px) | √ | 73.80 | 83.34 | 46.59 | 582.00 | 46.00 |
The results of the ImageNet linear probe are as follows:
| Model Name | ImageNet Linear Probe | Hugging Face |
|---|---|---|
| MLCD-ViT-B-32-224px | 79.1 | HF:MLCD-ViT-B-32-224px |
| MLCD-ViT-L-14-336px | 86.3 | HF:MLCD-ViT-L-14-336px |
| MLCD-ViT-bigG-14-224px | 87.1 | HF:MLCD-ViT-bigG-14-224px |
You can find more useful information here.
https://github.com/deepglint/unicom
hello @anxiangsir
It's very interesting, I'm currently studying about llava video qwen2 7b, it uses vision model : siglip-so400m-patch14-384, can you share how to switch vision model to use MLCD-ViT-B-32-224px ?
I have another question. Have you ever used it with LLaVA-Vision-Qwen2-7B? If so, What is the max_frames_num you set? And when you run it (LLaVA-Vision-Qwen2-7B + MLCD-ViT-bigG-14-224px), how much vram gpu does it use? And what gpu do you use?