vllm
vllm copied to clipboard
[ROCm] Add support for Punica kernels on AMD GPUs
This PR adds ROCm support for punica kernels to enable multi-LoRA on AMD GPUs. Some Punica files are slightly refactored so that the correct c++/hipcc compilers can be invoked when building under ROCm. A custom bgmv shrink kernel is added to account for the difference in warp size between AMD's GPUs and Nvidia's. The port has been tested on MI210, and the unit tests applying LoRA are passing.
@hongxiayang @lcskrishna Could you help review this PR?
@hongxiayang @dllehr-amd Could you review this PR? This is an important PR that enables the AMD GPUs to support multi-LoRA serving, which is a key feature in vLLM liked by many users.
This script can help verify this works end to end https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py
will check. Thanks for this effort.
I was going to try this out soon, is this in a good spot or is it still being worked?
I was going to try this out soon, is this in a good spot or is it still being worked?
It's in a good state for testing, though occasionally I'll be merging in the upstream to fix conflicts before it gets merged.
@kliuae Sorry for the late review. The PR looks good. Could you please resolved the merge conflict in CMakeLists.txt so that I can merge it? Thanks!
@kliuae Please resolve the latest merge conflict. Your PR is instrumental for our ongoing effort. Thank you very much!
@WoosukKwon Merge conflicts are resolved