Yizhi Wang comments

Results 10 comments of


                                            Yizhi Wang

Issue 1: Missing docstring for internode_combine method

I think you should check AI-generated issues yourself before submitting to avoid basic mistakes.

[test_internode.py] failed on multi-QP: dispatch timeout on ROCE network with testing 2*H20 nodes

> https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp @sphish May I ask if this change will be incorporated into the main branch?

Can we reduce the kernel latency for normal dispatch and combine when overlapping with gemm kernels through tuning the UNROLL_FACTOR?

> Nice work! But actually I don't have some guideline to tune this. And we are working on a TMA version instead of any kind of LD/ST copies. We will...

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

I am also experiencing the same hang issue. In my case, it occurs with a setup of 4 machines (H20*8), where 2 machines function normally. The program is also stuck...

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

@shifangx I have also tested on 4*4 GB200, and it runs successfully. Here's my tests on 4*8 H20. When set do_check= False or skip round_scale==True, the test can run successfully....

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

> what about `8*4 GB200` @shifangx Unfortunately, I only have a 4x4 GB200 , so I can't test the larger configuration

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

@shifangx Thanks! The assert didn't print out correctly after it was triggered, which made the issue quite confusing. What is the theoretical lower bound for precision when using ue8m0 for...

[BUG] test_internode hangs and times out on repeated runs

another error ``` [config] num_tokens=4096, hidden=7168, num_topk_groups=4, num_topk=8 [layout] Kernel performance: 0.045 ms [testing] Running with BF16, without top-k (async=False, previous=False) ... passed [testing] Running with BF16, with top-k (async=False,...

[BUG] test_internode hangs and times out on repeated runs

some other errors. (WIth my print modify)

When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results between two machine

@nannaer I'm also experiencing this issue. Are there any updates ? Thanks!