yanminjia issues

Results 4 issues of


                                            yanminjia

[QUESTION] Why take too much time to sync up barrier information between ranks

An issue is identified when we test megatron-LM by going with 6B model with 1K GPUs. Basically, by checking the output of each iteration, we found the difference of min...

stale

crash when set NVSHMEM_IBGDA_ENABLE_MULTI_PORT to 1 in case of dual-port RNIC

When I ran test_internode.py in case of dual-port, the environment variable _**NVSHMEM_IBGDA_ENABLE_MULTI_PORT**_ is set to 1. Unfortunately, DeepEP crashed when create rdma team by calling nvshmem_team_split_strided(...) in the following code...

Benchmark test over RoCE network

We ran test_internode.py over RoCE network with 4 H800 servers with 8 GPUs as per one server. But the test result is pretty poor by comparing with the case of...

Crash when test DeepEP over 16 H100 Servers

When we test DeepEP (test_internode.py) over 16 H100 servers, dispatch phase is finished successfully. Unfortunately, DeepEP crashed in combine phase. It looks DeepEP crashed at an asert statement. It would...