Jee Jee Li comments

Results 206 comments of


                                            Jee Jee Li

[Kernel][RFC] Refactor the punica kernel based on Triton

Currently, while the sgmv I've implemented can achieve high performance long sequence scenarios, it falls short compared to Punica's bgmv in cases involving small batches and short sequences. I'm working...

[Kernel][RFC] Refactor the punica kernel based on Triton

Upload the test result using YI-34B: ![9bfde45f032a2c8670a723deaed806c3](https://github.com/vllm-project/vllm/assets/19733142/3299065b-ff4b-40d9-89ce-b38f005b0c56)

[Kernel][RFC] Refactor the punica kernel based on Triton

Outstanding issues: - Full-shard LoRA support - Resolve the other LoRA's tests

[Kernel][RFC] Refactor the punica kernel based on Triton

The markers of the first integration method: https://github.com/jeejeelee/vllm/tree/00e007695c8cfa466f53fa74a0a601aa42a10cd7

[Kernel][RFC] Refactor the punica kernel based on Triton

> Will you merge this soon? Thank you for your attention. I'm not sure if we can merge yet, but I have completed most of the development work. You can...

[Kernel][RFC] Refactor the punica kernel based on Triton

@simon-mo Could you please check why the CI test failed? I have actually completed the unit tests locally and would like to see if there are any omissions.

[Kernel][RFC] Refactor the punica kernel based on Triton

@Yard1 Thanks for your review, I will fix these asap

[Kernel][RFC] Refactor the punica kernel based on Triton

> max_num_batched_tokens must be

[Kernel][RFC] Refactor the punica kernel based on Triton

/ready

[Kernel][RFC] Refactor the punica kernel based on Triton

# Libentry test The current version of Triton used in vLLM is 2.3.1, while the official Triton version is 3.0.0. Therefore, we tested the usage of libentry in these two...