Roger Wang
Roger Wang
Once we merge the PR to support multi-image/video input, it should be pretty straightforward to add the support for this model in vLLM!
Thank you for the PR! @huseinzol05 Currently our infrastructure support for encode-decoder is still WIP (@robertgshaw2-neuralmagic should be able to provide more context here), so I think it's probably a...
@wangchen615 Please correct me if I misunderstood, but is this for testing the case where you have another layer on top of the model deployment with concurrency control?
I recently made https://github.com/vllm-project/vllm/pull/3194 to add prefix caching benchmark - @wangchen615 let me know if you want me to include changes to resolve this issue in that PR as well!
> > @fyabc are you interested in implementing this? > > Hi, our team are developing on Qwen2-Audio vllm support, please check [this branch](https://github.com/faychu/vllm/tree/qwen2-audio?rgh-link-date=2024-09-12T04%3A46%3A45Z), and @faychu will take effort on...
> > @imkero @wulipc do you have any LoRA-tuned models that can be used? cc @ywang96 > > Sorry I don't have one currently It would be great if we...
@njhill Wow! 30% is quite a bit (albeit serving llama-7b over 2 A100-80G probably doesn't really make sense in practice). I will do some testing on this branch on parallel...
@njhill I did some preliminary testing on H100 TP2 with `mistralai/Mistral-7B-Instruct-v0.1` and there's definitly some speedup (not as much as 30% since this is running on sharegpt). Server launch command:...
> For llama-2-70b with single request 5 input / 1000 output the times I got are 32.3 before, 30.8 after i.e. 4-5% speedup. I will test on A100-80G with Mixtral...
Results for Mixtral on A100-80G - For each configuration I ran 3 times and took the best results (usually the first run is bad because of warmup) TP4 ``` With...