lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Support multiple ranks per SGMV op

Open tgaddair opened this issue 1 year ago • 0 comments
trafficstars

Currently we support multiple ranks per batch via a loop, but this reduces batching effect and makes the process infeasible for CUDA graphs. Instead, we can pad our the buffers to the size of the largest rank in the batch to support multiple ranks per batch / modify the SGMV kernel to operate on mixed ranks.

tgaddair avatar Jan 04 '24 18:01 tgaddair