vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[New Model]: LLaVA-OneVision

Open ethanporcaro opened this issue 1 year ago • 3 comments

The model to consider.

https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov

There are a bunch of others using the same architecture.

The closest model vllm already supports.

qwen2. AFAIK the main difference is a vision encoder which I think is based on siglip (also supported)

What's your difficulty of supporting the model you want?

Mixing qwen2 and siglip (maybe other changes)

ethanporcaro avatar Aug 12 '24 15:08 ethanporcaro

Once we merge the PR to support multi-image/video input, it should be pretty straightforward to add the support for this model in vLLM!

ywang96 avatar Aug 12 '24 16:08 ywang96

Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now.

DarkLight1337 avatar Sep 12 '24 04:09 DarkLight1337

Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now. I have implemented llava-ov support. After the benchmark evaluation done, I will make a PR for this.

litianjian avatar Sep 12 '24 12:09 litianjian

I've tried this model with BitsAndBytes 4-bit quantization, it looks like it is not still supported like it is in HuggingFace Transformers. Do you also plan on adding support for quantization of this model?

salvaba94 avatar Oct 03 '24 15:10 salvaba94

Any benchmark about this multimodel serving?

zihaolucky avatar Nov 30 '24 06:11 zihaolucky

Our benchmark scripts support multimodal datasets. See https://github.com/vllm-project/vllm/pull/8495 for some examples.

DarkLight1337 avatar Dec 03 '24 05:12 DarkLight1337