When I run
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/GO-BP/gearnet_edge.yaml --gpus [0,1,2,3] --ckpt
on worker*1 Tesla-V100-SXM2-32GB:4 GPU, 47 CPU, I got the error:
[219013] [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219014] [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219015] [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219016] Traceback (most recent call last):
[219017] File "/hubozhen/GearNet/script/downstream.py", line 75, in
[219018] train_and_validate(cfg, solver, scheduler)
[219019] File "/hubozhen/GearNet/script/downstream.py", line 30, in train_and_validate
[219020] solver.train(**kwargs)
[219021] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/core/engine.py", line 155, in train
[219022] loss, metric = model(batch)
[219023] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219024] return forward_call(*input, **kwargs)
[219025] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
[219026] output = self.module(*inputs[0], **kwargs[0])
[219027] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219028] return forward_call(*input, **kwargs)
[219029] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 279, in forward
[219030] pred = self.predict(batch, all_loss, metric)
[219031] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 300, in predict
[219032] output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric)
[219033] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219034] return forward_call(*input, **kwargs)
[219035] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/models/gearnet.py", line 99, in forward
[219036] edge_hidden = self.edge_layers[i](line_graph, edge_input)
[219037] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219038] return forward_call(*input, **kwargs)
[219039] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 92, in forward
[219040] output = self.combine(input, update)
[219041] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 438, in combine
[219042] output = self.batch_norm(output)
[219043] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219044] return forward_call(*input, **kwargs)
[219045] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward
[219046] world_size,
[219047] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 42, in forward
[219048] dist._all_gather_base(combined_flat, combined, process_group, async_op=False)
[219049] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2070, in _all_gather_base
[219050] work = group._allgather_base(output_tensor, input_tensor)
[219051] RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219052] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219053] index1 = local_index // local_inner_size + offset1
[219054] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219055] index1 = local_index // local_inner_size + offset1
[219056] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219057] terminate called after throwing an instance of 'std::runtime_error'
[219058] what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219059] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219060] terminate called after throwing an instance of 'std::runtime_error'
[219061] what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219062] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/data/graph.py:1667: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219063] edge_in_index = local_index // local_inner_size + edge_in_offset
[219064] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219065] terminate called after throwing an instance of 'std::runtime_error'
[219066] what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219067] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 21 closing signal SIGTERM
[219068] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary: /opt/anaconda3/envs/manifold/bin/python
[219069] Traceback (most recent call last):
[219070] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[219071] "main", mod_spec)
[219072] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 85, in _run_code
[219073] exec(code, run_globals)
[219074] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
[219075] main()
[219076] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
[219077] launch(args)
[219078] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
[219079] run(args)
[219080] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
[219081] )(*cmd_args)
[219082] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
[219083] return launch_agent(self._config, self._entrypoint, list(args))
[219084] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
[219085] failures=result.failures,
[219086] torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
[219087] ===================================================
[219088] /hubozhen/GearNet/script/downstream.py FAILED
[219089] ---------------------------------------------------
[219090] Failures:
[219091] [1]:
[219092] time : 2022-12-12_09:41:02
[219093] host : pytorch-7c3c96f1-d9hcm
[219094] rank : 2 (local_rank: 2)
[219095] exitcode : -6 (pid: 22)
[219096] error_file: <N/A>
[219097] traceback : Signal 6 (SIGABRT) received by PID 22
[219098] [2]:
[219099] time : 2022-12-12_09:41:02
[219100] host : pytorch-7c3c96f1-d9hcm
[219101] rank : 3 (local_rank: 3)
[219102] exitcode : -6 (pid: 23)
[219103] error_file: <N/A>
[219104] traceback : Signal 6 (SIGABRT) received by PID 23
[219105] ---------------------------------------------------
[219106] Root Cause (first observed failure):
[219107] [0]:
[219108] time : 2022-12-12_09:41:02
[219109] host : pytorch-7c3c96f1-d9hcm
[219110] rank : 0 (local_rank: 0)
[219111] exitcode : -6 (pid: 20)
[219112] error_file: <N/A>
[219113] traceback : Signal 6 (SIGABRT) received by PID 20
[219114] ===================================================
Someone said this happened when loading big data, I find the use ratios of these for GPUs are 100%.
However, I changed the same procedure on another V100 mechaine (worker*1:
Tesla-V100-SXM-32GB:4 GPU, 48 CPU,), it is OK.
It confused me.