lorax Support multiple ranks per SGMV op

Support multiple ranks per SGMV op

Open tgaddair opened this issue 1 year ago • 0 comments

trafficstars

Currently we support multiple ranks per batch via a loop, but this reduces batching effect and makes the process infeasible for CUDA graphs. Instead, we can pad our the buffers to the size of the largest rank in the batch to support multiple ranks per batch / modify the SGMV kernel to operate on mixed ranks.

Jan 04 '24 18:01 tgaddair

lorax lorax copied to clipboard

Support multiple ranks per SGMV op

lorax
lorax copied to clipboard