flashinfer
flashinfer copied to clipboard
Can flashinfer's CutlassSegmentGEMMSM90Run function be used for LoRA computing on H20?
I'm implementing a LoRA sgmv kernel that fits into the Hopper architecture.
As done in Punica, I first tried to do the sgmv calculation using the grouped gemm in cutlass's example/57_hopper_grouped_gemm, but the performance was very poor (bandwidth was only a few hundred GB/s).
So I would like to ask the experts of CUDA if they have tested the performance of FlashInfer's CutlassSegmentGEMMSM90Run which also leverages the cutlass's grouped gemm, and is it possible that this could be used to implement sgmv?
example/57_hopper_grouped_gemm, but the performance was very poor (bandwidth was only a few hundred GB/s).
The default tile size is sub-optimal for small problem shapes such as lora, you can try small's tile sizes such as: https://github.com/pytorch/FBGEMM/blob/0040646440401257e005dbb8b5329225a3dbe31e/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu#L499-L517
@yzh119 Thanks for the reply, bro. I tried a smaller tile size as you suggested, and the performance did get an improvement (around 20%). But this still doesn't perform well because I'm using H20 and its memory bandwidth is still not fully utiulized, especially for problems like LoRA, which is LxMxNxK and M and N/K is small. I'm wondering if you have any suggestions that can help us develop an efficient sgmv/bgmv kernel under Hopper arch?