Woosuk Kwon comments

Results 281 comments of


                                            Woosuk Kwon

open_llama: ValueError: head_size (100) is not supported. Supported head sizes: [64, 80, 96, 128]

Hi @jibowang, thanks for raising the issue and apologies for the late response. Unfortunately, head size 100 is not supported by [xformers](https://github.com/facebookresearch/xformers). The library requires the head size to be...

ModuleNotFoundError: No module named "vllm._C"

Hi @Kawai1Ace, it seems you are using the latest main branch of vLLM. Did you [install vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source)? The `vllm._C` module is built when you install vLLM.

Incremental output for LLM entrypoint

Why don't we add another API for this instead of expanding `LLM`?

Custom all reduce kernels

@hanzhi713 This is awesome! Many thanks for the PR! A quick question: do you happen to know about the custom all-reduce kernels in [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu)? Is this PR related to the...

Custom all reduce kernels

@hanzhi713 BTW I got this error when using 2 L4 GPUs: ``` (RayWorkerVllm pid=51757) INFO 12-26 04:18:45 fast_allreduce.py:21] NVLink detection failed with message "Not Supported". This is normal if your...

Custom all reduce kernels

@hanzhi713 I still got the same error on 2 L4 GPUs. ``` (RayWorkerVllm pid=70031) INFO 12-27 08:52:31 fast_allreduce.py:70] NVLink detection failed with message "Not Supported". This is normal if your...

Custom all reduce kernels

@hanzhi713 Apologies for the delay. I had some personal issues for the last couple of weeks. I will review the PR today. BTW, one small concern on my side is...

[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm

@jpvillam-amd Can we have `ROCmAttentionBackend` that selects one of the four different implementations of prefill Attention (i.e., Triton, CK, xFormers, naive) while using `PagedAttention` for decoding? I feel like mixing...

[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm

@jpvillam-amd Thanks for updating the PR. Do you mind if I directly edit this PR? I refactored the PR a bit and wanted to directly upstream it to this PR...

[Kernel] Use flash-attn for decoding

@skrider Thanks for the great work! Can I directly fix this PR for faster integration?