Woosuk Kwon comments

Results 284 comments of


                                            Woosuk Kwon

[Kernel] Use flash-attn for decoding

@skrider I just edited this PR: 1) I removed dependency on your FlashAttention repo (Let's add it in the next PR), 2) I enabled the prefix-attention, and 3) I moved...

[Kernel] Use flash-attn for decoding

Status update: 1. I created `vllm-flash-attn` package in PyPI based on @skrider's small-page version of `flash-attn` (Repo: https://github.com/vllm-project/flash-attention). The package was built with PyTorch 2.1.2 and CUDA 12.1. I cut...

[Kernel] Use flash-attn for decoding

@simon-mo It's not ready because it doesn't produce the correct results when using attention with prefix. Also, this PR needs more tests. It'll be hard to include this in v0.4.0...

[Kernel] Use flash-attn for decoding

Related PRs: #2744 and #3010

[Kernel] Use flash-attn for decoding

@rkooo567 Thanks for letting me know that the wheel doesn't work on ubuntu 20.04. Let me fix this before merge. The wheel is already built on torch 2.3 btw. >...

Install OSError when running pip install vllm with python 3.10

Hi @oximi123, unfortunately, vLLM does not officially support windows at the moment (while some users succeeded in using it on windows). Could you please use WSL and see whether the...

Remove Ray for the dependency

@lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to workers....

Remove Ray for the dependency

Thanks for the advice @lanking520! We will take that into account. Currently, we are focusing on fixing bugs & adding requested models. After these are addressed, we will look into...

ray OOM in tensor parallel

Hi @liulfy, it's because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem.

Bug: Windows installation

Unfortunately, vLLM does not support Windows at the moment.