[Question] Question on using num_worst_tokens

Open yuhyao opened this issue 1 month ago • 0 comments

Hi DeepSeek team,

I noticed that num_worst_tokens is included in the arguments of dispatch, which makes sense for CUDA Graph support. However, it seems that the values written to moe_recv_counter and moe_recv_expert_counter are no longer directly accessible. From the code, it looks like these counters can be recomputed from recv_topk_idx, since padded entries are set to -1.

The prefill trace also doesn’t show any H2D memcpy after the dispatch kernel, so I assume you are recomputing those counters inside another kernel?

Would it be possible to write these counters into CUDA tensors directly? For example, open-source projects like SGLang simply copy num_recv_tokens_per_expert_list to device memory. Having these values available on CUDA would make it easier to switch to CPU-async mode and may also help avoid redundant memory accesses.

Thanks!

Nov 28 '25 10:11 yuhyao