DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Can we reduce the kernel latency for normal dispatch and combine when overlapping with gemm kernels through tuning the UNROLL_FACTOR?

Open yaning223 opened this issue 6 months ago • 6 comments

Hello, we have a question about intra-node copy efficiency that influence the performance and we appreciate your help! We notice degraded performance of communication kernels when overlapping communication and computation. Through analysis, we think that the performance bottleneck shifted from network bandwidth limitations to intra-node copy under overlapping scenarios. To address this, we adjust the UNROLL_FACTOR parameter in intra-node copy function UNROLLED_WARP_COPY to reduce copy latency. Here's some result (via a simple benchmark test simulating communication-computation overlap):

Case(num_tokens=8k) \ Latency(ms) dispatch combine performance degradation
DeepEP 3.54 6.91 0%
DeepEP+overlap 4.18 6.97 dispatch: 18.1% combine: 0.87%
DeepEP+overlap+adapted UNROLL_FACTOR 3.82 6.94 dispatch: 7.9% combine: 0.14%

Single-stream tests: Performance remains largely unchanged overlap tests: Mitigating the observed performance degradation

This parameter tuning is also effective when the number of SM used by communication is small (guessing that intra-node copy is the bottleneck at this time). For example, when the number of SM is reduced to 12 or 18, adjusting UNROLL_FACTOR can reduce the dispatch latency by 11%.

Therefore, we are very curious about the appropriate setting of this parameter and the reason for setting it this way. We look forward to your suggestions. Thank you!

yaning223 avatar Jun 04 '25 06:06 yaning223

Nice work! But actually I don't have some guideline to tune this. And we are working on a TMA version instead of any kind of LD/ST copies. We will push the TMA version later, it is faster than LD/ST on general cases (intranode is finished, working on internode, will finish in weeks), and you can avoid tuning such things.

LyricZhao avatar Jun 05 '25 06:06 LyricZhao

Nice work! But actually I don't have some guideline to tune this. And we are working on a TMA version instead of any kind of LD/ST copies. We will push the TMA version later, it is faster than LD/ST on general cases (intranode is finished, working on internode, will finish in weeks), and you can avoid tuning such things.

Hi Lyric, @LyricZhao Will modifying to TMA only bring performance improvements when overlapping? I tested the performance of the latest commit on H100 using test_intranode with different numbers of SMs configuration, but did not observe any performance improvement.

polarstormx avatar Jun 09 '25 09:06 polarstormx

Will modifying to TMA only bring performance improvements when overlapping?

For intranode kernels, honestly, it is simply a demo (or load/store coding style transition), which just reduced register pressure and troubles tuning unrolling factors. The original version has already reached the limit, so it is normal to see no performance gain.

I tested the performance of the latest commit on H100 using test_intranode with different numbers of SMs configuration, but did not observe any performance improvement.

Yes, you are right. But for internode kernels, this is a huge speedup (especially with less nodes). We will push it later.

LyricZhao avatar Jun 09 '25 10:06 LyricZhao

Nice work! But actually I don't have some guideline to tune this. And we are working on a TMA version instead of any kind of LD/ST copies. We will push the TMA version later, it is faster than LD/ST on general cases (intranode is finished, working on internode, will finish in weeks), and you can avoid tuning such things.

Thanks for your reply, can't wait to see the new version to improve inter-node performance!

yaning223 avatar Jun 09 '25 11:06 yaning223

Hi @yaning223 ,thanks for posting this interesting issue. I'd like to ask, in the case of overlap, did you increase or decrease the unroll factor to reduce the performance degradation in DeepEP?

retonym avatar Jul 18 '25 04:07 retonym

@LyricZhao We conducted some tests and observed substantial performance improvements in internode mode, which is very promising. However, when switching to LL mode, we didn't notice a significant difference. We were wondering if there might be something we misunderstood or misconfigured. Any guidance or suggestions would be greatly appreciated!

Will modifying to TMA only bring performance improvements when overlapping?

For intranode kernels, honestly, it is simply a demo (or load/store coding style transition), which just reduced register pressure and troubles tuning unrolling factors. The original version has already reached the limit, so it is normal to see no performance gain.

I tested the performance of the latest commit on H100 using test_intranode with different numbers of SMs configuration, but did not observe any performance improvement.

Yes, you are right. But for internode kernels, this is a huge speedup (especially with less nodes). We will push it later.

weiwei99 avatar Aug 01 '25 10:08 weiwei99