Woosuk Kwon
Woosuk Kwon
Hi @jibowang, thanks for raising the issue and apologies for the late response. Unfortunately, head size 100 is not supported by [xformers](https://github.com/facebookresearch/xformers). The library requires the head size to be...
Hi @Kawai1Ace, it seems you are using the latest main branch of vLLM. Did you [install vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source)? The `vllm._C` module is built when you install vLLM.
Why don't we add another API for this instead of expanding `LLM`?
@hanzhi713 This is awesome! Many thanks for the PR! A quick question: do you happen to know about the custom all-reduce kernels in [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu)? Is this PR related to the...
@hanzhi713 BTW I got this error when using 2 L4 GPUs: ``` (RayWorkerVllm pid=51757) INFO 12-26 04:18:45 fast_allreduce.py:21] NVLink detection failed with message "Not Supported". This is normal if your...
@hanzhi713 I still got the same error on 2 L4 GPUs. ``` (RayWorkerVllm pid=70031) INFO 12-27 08:52:31 fast_allreduce.py:70] NVLink detection failed with message "Not Supported". This is normal if your...
@hanzhi713 Apologies for the delay. I had some personal issues for the last couple of weeks. I will review the PR today. BTW, one small concern on my side is...
@jpvillam-amd Can we have `ROCmAttentionBackend` that selects one of the four different implementations of prefill Attention (i.e., Triton, CK, xFormers, naive) while using `PagedAttention` for decoding? I feel like mixing...
@jpvillam-amd Thanks for updating the PR. Do you mind if I directly edit this PR? I refactored the PR a bit and wanted to directly upstream it to this PR...
@skrider Thanks for the great work! Can I directly fix this PR for faster integration?