flashinfer Can flashinfer's CutlassSegmentGEMMSM90Run function be used for LoRA computing on H20?

Can flashinfer's CutlassSegmentGEMMSM90Run function be used for LoRA computing on H20?

Open chenhongyu2048 opened this issue 7 months ago • 2 comments

I'm implementing a LoRA sgmv kernel that fits into the Hopper architecture.

As done in Punica, I first tried to do the sgmv calculation using the grouped gemm in cutlass's example/57_hopper_grouped_gemm, but the performance was very poor (bandwidth was only a few hundred GB/s).

So I would like to ask the experts of CUDA if they have tested the performance of FlashInfer's CutlassSegmentGEMMSM90Run which also leverages the cutlass's grouped gemm, and is it possible that this could be used to implement sgmv?

Apr 23 '25 13:04 chenhongyu2048

example/57_hopper_grouped_gemm, but the performance was very poor (bandwidth was only a few hundred GB/s).

The default tile size is sub-optimal for small problem shapes such as lora, you can try small's tile sizes such as: https://github.com/pytorch/FBGEMM/blob/0040646440401257e005dbb8b5329225a3dbe31e/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu#L499-L517

Apr 24 '25 19:04 yzh119

@yzh119 Thanks for the reply, bro. I tried a smaller tile size as you suggested, and the performance did get an improvement (around 20%). But this still doesn't perform well because I'm using H20 and its memory bandwidth is still not fully utiulized, especially for problems like LoRA, which is LxMxNxK and M and N/K is small. I'm wondering if you have any suggestions that can help us develop an efficient sgmv/bgmv kernel under Hopper arch?

Apr 25 '25 09:04 chenhongyu2048

flashinfer flashinfer copied to clipboard

Can flashinfer's CutlassSegmentGEMMSM90Run function be used for LoRA computing on H20?

flashinfer
flashinfer copied to clipboard