DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Suggestions for Improving the Readability of DeepEP Code

Open zuochunwei opened this issue 9 months ago • 12 comments

I recently spent some time studying the source code of DeepEP, particularly the cross-node communication execution pipeline—excellent work with high engineering quality, I learned so much from it. However, while reading the code, I encountered some obstacles that took me quite a while to fully understand the author's intent. I’d like to share the issues I faced during my learning process and provide a few improvement suggestions:

  1. RDMA Buffer Layout and Operations

I believe that instead of constructing the SymBuffer rdma_recv_num_tokens_mixed object via rdma_buffer_ptr, a simpler and clearer approach would be to define corresponding structs.

  1. Lack of Sufficient Annotations

The implementations of get_rdma_clean_meta and get_nvl_clean_meta lack sufficient comments, especially regarding the meaning of the second field in the returned pair.

  1. Overly Complex Interface Parameters

The internode_dispatch and internode_combine interfaces have too many parameters, which are then passed almost unchanged in subsequent calls. This not only increases comprehension difficulty but also raises the risk of incorrect parameter passing. A possible solution is to encapsulate them in a struct.

If you agree with these suggestions, I’d be happy to submit a PR to address these points. Thank you! @LyricZhao @sphish @haswelliris @dzhulgakov @youkaichao @fzyzcjy

zuochunwei avatar Mar 28 '25 08:03 zuochunwei

Thank you for your thoughtful feedback and for taking the time to study DeepEP's codebase so thoroughly! I really appreciate your kind words about the engineering quality, and your suggestions are all quite reasonable - especially the second point about insufficient annotations, which I agree could be particularly challenging for anyone who wasn't involved in the original design.

That said, we initially open-sourced this project primarily to make it available for others to use, and maintaining additional documentation or interface changes would create significant overhead for our team. We actually maintain an internal fork with some differences (like your third suggestion about parameter encapsulation), and keeping these versions aligned is already quite of dirty work.

Beyond the points you mentioned, there are many other complex design decisions (some even requiring mathematical proofs) that aren't currently documented. Comprehensively documenting everything would be quite an undertaking.

Instead, might I suggest an alternative? You clearly have insight into the codebase - would you consider writing a blog post or creating an interpretation guide in the Issues section? I'd be happy to link to it from the README to help other developers learn from your understanding. This way we can share knowledge without creating maintenance burdens.

Thanks again for your valuable perspective - we really appreciate it!

LyricZhao avatar Mar 28 '25 08:03 LyricZhao

BTW, we are also planning a full refactor (better performance, less SMs, better readability) maybe several months later :)

LyricZhao avatar Mar 28 '25 08:03 LyricZhao

Thanks for your reply.

I spent roughly ten days studying the relevant source code. Some sections lacked clear intent - I had to carefully analyze them alongside diagrams and cross-reference other parts of the codebase to understand their purpose. I believe other developers researching DeepEP's source code would encounter similar challenges. To address this, I've compiled nearly 20,000 words of documentation, added extensive code annotations, and created several explanatory diagrams. However, all these materials are currently in Chinese, and I likely won't have capacity to produce English versions. I'm attaching some representative examples to this issue.

I look forward to future improved versions of the codebase.

zuochunwei avatar Mar 28 '25 09:03 zuochunwei

Image

zuochunwei avatar Mar 28 '25 09:03 zuochunwei

Image

zuochunwei avatar Mar 28 '25 09:03 zuochunwei

Image

zuochunwei avatar Mar 28 '25 09:03 zuochunwei

You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻

LyricZhao avatar Mar 28 '25 09:03 LyricZhao

You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻

OK, I will consider your suggestion.

zuochunwei avatar Mar 28 '25 09:03 zuochunwei

You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻

OK, I will consider your suggestion.

最近我也在看 DeepEP 源码,感觉你分析挺深入的,如果你能给大家分享你的分析文章,或许大家会碰撞出一些新的火花

alpha-baby avatar Apr 11 '25 03:04 alpha-baby

Hi, is there any updates? Looking forward to it!

fzyzcjy avatar May 30 '25 00:05 fzyzcjy

You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻

OK, I will consider your suggestion.

looking forward to it. Thanks for your great work. :)

yanminjia avatar Jun 05 '25 02:06 yanminjia

You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻你可以在 issues 中分享中文版本(新建 issue 也可以),或者提供 fork 的仓库链接或博客链接 👍🏻

OK, I will consider your suggestion.好的,我会考虑你的建议。

we looking forward your forked repo link or blog link!!!

harleyszhang avatar Jul 31 '25 07:07 harleyszhang