output tokens in intranode::dispatch for each expert isn't packed?

Open zhang662817 opened this issue 9 months ago • 1 comments

The output token in the intranode::dispatch kernel offsets is channel_offset + rank_offsets + recv_token_idx; The tokens for each expert isn't contiguous?

If yes, do you have plan to optimize these?

Thanks.

Mar 07 '25 13:03 zhang662817

The tokens for each expert isn't contiguous?

Yes, and it is by design. Some MoE models will make one token averagely select more than one expert in a GPU rank. So if we automatically expand/copy these tokens into different experts, the GPU memory consumption will be $\times$ expert-per-rank-per-token-selected, and make training OOM.

So if you want to do grouped GEMM later, you have to expand the tokens and control the GPU memory precisely by yourself. We currently don't have plans to refactor this design.

Mar 10 '25 01:03 LyricZhao