yanminjia

Results 4 issues of yanminjia

An issue is identified when we test megatron-LM by going with 6B model with 1K GPUs. Basically, by checking the output of each iteration, we found the difference of min...

stale

When I ran test_internode.py in case of dual-port, the environment variable _**NVSHMEM_IBGDA_ENABLE_MULTI_PORT**_ is set to 1. Unfortunately, DeepEP crashed when create rdma team by calling nvshmem_team_split_strided(...) in the following code...

We ran test_internode.py over RoCE network with 4 H800 servers with 8 GPUs as per one server. But the test result is pretty poor by comparing with the case of...

When we test DeepEP (test_internode.py) over 16 H100 servers, dispatch phase is finished successfully. Unfortunately, DeepEP crashed in combine phase. It looks DeepEP crashed at an asert statement. It would...