Congcong Chen

Results 10 comments of Congcong Chen

Hi @LucasWilkinson , with this PR merged, now I could not build vLLM from source any more with the following errors, could you please help to look into that? The...

@mgoin , PR https://github.com/vllm-project/vllm/pull/7730 is not working for me unfortunately.

Sorry the temporary log was deleted automatically. And looks like after I apply PR https://github.com/vllm-project/vllm/pull/7730, the error is different. See ``` ets->outlines=0.0.43->vllm==0.5.4+cu118) (2024.1) Requirement already satisfied: six>=1.5 in /home/aiscuser/.local/lib/python3.10/site-packages (from...

Not sure why this PR triggers build of MacheteKernel for me locally. Looks like a bug. Can we revert this PR since it affects other users. cc @simon-mo

@LucasWilkinson , https://github.com/vllm-project/vllm/pull/7757 doesn't work as well. With the patch, now the build is successfully, but I failed to run vLLM server now, see error below: ``` (myenv) aiscuser@node-0:~/vllm/benchmarks$ python...

Feel free to check out the PR description [here](https://github.com/vllm-project/vllm/pull/14119) for steps on: 1. Starting the server with the base model and vision/speech LoRA weights. 2. Sending requests to the OpenAI-compatible...

We also found that LoRA(Punica) kernels are quite slow for [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct). Here’s the fix: [PR #14272](https://github.com/vllm-project/vllm/pull/14272). With this fix, we observed up to a 5x improvement in generation speed.

Looking into this bug, I found that chunked prefill is not correctly supported by the block-sparse attention module used by the Phi-3-small-128k-instruct model. And chunked prefill is turned on by...

> Thanks for your contribution, have you tested the performance of autotune on models like llama? Nope, I am not familiar with the llama family of models that use LoRA....