Shifang Xu
Shifang Xu
Hi, @Infi-zc, thanks for your feedback. Can you provide the hyperparameters used to reproduce the issue? I will first try to replicate the problem and then proceed with debugging and...
I will take a look at this.
I have not been able to reproduce the issue yet. This test case requires the use of IBGDA and needs the cluster administrator to help with the configuration. I have...
I have tested on GB200 with 4 nodes, and each node has 4GPUs, test passed. I can not reproduce this issue on GB200. @jeffye-dev @polarstormx , Can you help to...
@polarstormx Thank you very much for your feedback.
> I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older...
> I have also tested on 4*4 GB200, and it runs successfully. @polarstormx , what about `8*4 GB200`? I have tested on GB200. It runs successfully on `2*4 GB200` ,`4*4...
**root cause: One rank fails at an assert and then exits. All the other ranks are waiting for that rank at some point, which causes the hang.** I ran experiments...
Hi, wan-nan, Thanks for looking into it. This is being addressed in an internal MR.
This issue is fixed with the following commit https://github.com/NVIDIA/Megatron-LM/commit/87d9d2506acefaf3bd617b27ebbd24c7ddfcea5c