Woosuk Kwon
Woosuk Kwon
@skrider I just edited this PR: 1) I removed dependency on your FlashAttention repo (Let's add it in the next PR), 2) I enabled the prefix-attention, and 3) I moved...
Status update: 1. I created `vllm-flash-attn` package in PyPI based on @skrider's small-page version of `flash-attn` (Repo: https://github.com/vllm-project/flash-attention). The package was built with PyTorch 2.1.2 and CUDA 12.1. I cut...
@simon-mo It's not ready because it doesn't produce the correct results when using attention with prefix. Also, this PR needs more tests. It'll be hard to include this in v0.4.0...
Related PRs: #2744 and #3010
@rkooo567 Thanks for letting me know that the wheel doesn't work on ubuntu 20.04. Let me fix this before merge. The wheel is already built on torch 2.3 btw. >...
Hi @oximi123, unfortunately, vLLM does not officially support windows at the moment (while some users succeeded in using it on windows). Could you please use WSL and see whether the...
@lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to workers....
Thanks for the advice @lanking520! We will take that into account. Currently, we are focusing on fixing bugs & adding requested models. After these are addressed, we will look into...
Hi @liulfy, it's because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem.
Unfortunately, vLLM does not support Windows at the moment.