vllm [New Model]: LLaVA-OneVision

The model to consider.

https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov

There are a bunch of others using the same architecture.

The closest model vllm already supports.

qwen2. AFAIK the main difference is a vision encoder which I think is based on siglip (also supported)

What's your difficulty of supporting the model you want?

Mixing qwen2 and siglip (maybe other changes)

Aug 12 '24 15:08 ethanporcaro

Once we merge the PR to support multi-image/video input, it should be pretty straightforward to add the support for this model in vLLM!

Aug 12 '24 16:08 ywang96

Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now.

Sep 12 '24 04:09 DarkLight1337

Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now. I have implemented llava-ov support. After the benchmark evaluation done, I will make a PR for this.

Sep 12 '24 12:09 litianjian

I've tried this model with BitsAndBytes 4-bit quantization, it looks like it is not still supported like it is in HuggingFace Transformers. Do you also plan on adding support for quantization of this model?

Oct 03 '24 15:10 salvaba94

Any benchmark about this multimodel serving?

Nov 30 '24 06:11 zihaolucky

Our benchmark scripts support multimodal datasets. See https://github.com/vllm-project/vllm/pull/8495 for some examples.

Dec 03 '24 05:12 DarkLight1337