Results 9 comments of liuhatry

Hi @denera I tested examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py 1 node can work, 2 nodes failed. # My environment: H800 NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 torch 2.1.1 te: 1.9.0.dev0+70111a3 **I modified...

Hi @denera I tested the new code, it still failed `NCCL_SOCKET_IFNAME=bond1 python3 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000 --num-replicas 1` !!! [NVTE] Bootstrapping Userbuffers with backend="gloo" !!! [NVTE] Number of physical nodes: 1 !!!...

Hi @denera your example code cannot run as before I have checked the torch code, it indicates: > The Gloo backend does not support this API. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L3392 File "gloo.py", line...

Hi @denera I can run your snippet, but cannot run ln_mlp_with_overlap.py with two nodes. # snippet ## one node in nccl can run export BOOTSTRAP_BACKEND=nccl torchrun --nproc_per_node 8 --nnodes 1...

Hi @denera run with UB_SKIPMC=1 will also fail: !!! [NVTE] Bootstrapping Userbuffers with backend="gloo" !!! [NVTE] Number of physical nodes: 1 !!! [NVTE] Global ranks on node 0: [0, 1,...

Hi @denera Thanks for your reply. Because my torch version is 2.1, when set num_replicas=2, the example will fail: File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 175, in train AttributeError: module 'torch.distributed' has no...

Hi @denera, I met a new error when run in two nodes, the intra node barrier will hang: ` TENCENT64:860293:861393 [0] bootstrap.cc:150 NCCL WARN Bootstrap Root : rank 5 of...

Hi @denera, can you please help to confirm this issue, thks.

Hi @denera , PR #1087 can fix my problem, thanks.