punica
punica copied to clipboard
Reasons for switching to CUTLASS-based kernel instead of custom kernel
Hey folks, awesome and really impactful work with the repo and the paper.
I was wondering what was the reason for switching from the original bgmv
kernel to a CUTLASS-based sgmv
one. I understand that one advantage of sgmv
is that it doesn't require the LoRA tensors to be in a single contiguous block of memory, but aside from that, are there any performance considerations that made you switch?
I can also see that there is a custom sgmv
shrink kernel implementation but the expand version is WIP. Is that something you are planning to work on in the near future?
Furthermore, do the performance results in the paper concern the CUTLASS kernel or the custom kernel? From the description of the implementation I inferred the later, but I was confused by the lack of the custom expand kernel in the repo.
Thanks, and great work!