lorax
lorax copied to clipboard
Support multiple ranks per SGMV op
trafficstars
Currently we support multiple ranks per batch via a loop, but this reduces batching effect and makes the process infeasible for CUDA graphs. Instead, we can pad our the buffers to the size of the largest rank in the batch to support multiple ranks per batch / modify the SGMV kernel to operate on mixed ranks.