Allow using few SMs for low-latency mode
The code diff is surely not for merging, but for demonstration how the experiments below are done. If anyone is interested / this direction looks acceptable to be merged, I am happy to polish and further work on the code!
The code and experiment data are extracted from old experiments for my previous https://github.com/deepseek-ai/DeepEP/pull/249.
Figure 1: num-sm vs performance As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible.
For dispatch we may need to do extra work though, since the warp specialization may be suboptimal when there are few SMs.
As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible.
Hi, that is a good idea, but I want to understand the mentioned “overlap” in more detail. Does it refer to the overlap between the dispatch/combine kernels and the model computation kernels—i.e. two separate streams like in prefill? But during decode a CUDA Graph is enabled; wouldn’t the CUDA Graph turn them into a single sequential execution stream and eliminate that overlap?
Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial.
Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial.
I think maybe it involves two batch overlap,so the the communication kernel and the compute kernel can run at the same time ideally. But still as I mentioned,after with CUDA graph,it transfer to serial kernel. I am very curious how to avoid it.