Congcong Chen
Congcong Chen
Hi @LucasWilkinson , with this PR merged, now I could not build vLLM from source any more with the following errors, could you please help to look into that? The...
@mgoin , PR https://github.com/vllm-project/vllm/pull/7730 is not working for me unfortunately.
Sorry the temporary log was deleted automatically. And looks like after I apply PR https://github.com/vllm-project/vllm/pull/7730, the error is different. See ``` ets->outlines=0.0.43->vllm==0.5.4+cu118) (2024.1) Requirement already satisfied: six>=1.5 in /home/aiscuser/.local/lib/python3.10/site-packages (from...
Not sure why this PR triggers build of MacheteKernel for me locally. Looks like a bug. Can we revert this PR since it affects other users. cc @simon-mo
@LucasWilkinson , https://github.com/vllm-project/vllm/pull/7757 doesn't work as well. With the patch, now the build is successfully, but I failed to run vLLM server now, see error below: ``` (myenv) aiscuser@node-0:~/vllm/benchmarks$ python...
Feel free to check out the PR description [here](https://github.com/vllm-project/vllm/pull/14119) for steps on: 1. Starting the server with the base model and vision/speech LoRA weights. 2. Sending requests to the OpenAI-compatible...
We also found that LoRA(Punica) kernels are quite slow for [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct). Here’s the fix: [PR #14272](https://github.com/vllm-project/vllm/pull/14272). With this fix, we observed up to a 5x improvement in generation speed.
The vLLM version that works: v0.5.2
Looking into this bug, I found that chunked prefill is not correctly supported by the block-sparse attention module used by the Phi-3-small-128k-instruct model. And chunked prefill is turned on by...
> Thanks for your contribution, have you tested the performance of autotune on models like llama? Nope, I am not familiar with the llama family of models that use LoRA....