The error of "Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels....."
Hello, How can i fix this error? When I ran the command, and the command is ok in my view: torchrun --nproc_per_node=4 inference.py ../testFasta/PsCrtW-HpCrtZ.fasta ../databases/pdb_mmcif/mmcif_files/ --output_dir ./output_4gpu --uniref90_database_path ../databases/uniref90/uniref90.fasta --mgnify_database_path ../databases/mgnify/mgy_clusters_2018_12.fa --pdb70_database_path ../databases/pdb70/pdb70 --param_path ../databases/params/params_model_1.npz --model_name model_1 --cpus 24 --uniclust30_database_path ../databases/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --bfd_database_path ../databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --jackhmmer_binary_path which jackhmmer --hhblits_binary_path which hhblits --hhsearch_binary_path which hhsearch --kalign_binary_path which kalign
[07/25/22 10:45:46] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
[07/25/22 10:45:46] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[07/25/22 10:45:46] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:521 set_device
[07/25/22 10:45:46] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
[07/25/22 10:45:50] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python
random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel
seed is ParallelMode.DATA.
[07/25/22 10:45:50] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO:
/anaconda/envs/fastfold_py38/lib/python3.8/site-packages/colossalai/initialize.py:117
launch
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python
random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel
seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel
size: 1, pipeline parallel size: 1, tensor parallel size: 4
[07/25/22 10:45:50] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python
random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1026,the default parallel
seed is ParallelMode.DATA.
[07/25/22 10:45:50] INFO colossalai - colossalai - INFO: /anaconda/envs/fastfold_py38/lib/python3.8/site-packa
ges/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python
random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1027,the default parallel
seed is ParallelMode.DATA.
Generating features...
[07/25/22 10:46:12] INFO colossalai - root - INFO: Launching subprocess
"/anaconda/envs/fastfold_py38/bin/jackhmmer -o /dev/null -A
/tmp/tmphz8bnmj6/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001
-E 0.0001 --cpu 24 -N 1 ./output_4gpu/tmp.fasta ../databases/uniref90/uniref90.fasta"
INFO colossalai - root - INFO: Started Jackhmmer (uniref90.fasta) query
[07/25/22 11:11:02] INFO colossalai - root - INFO: Finished Jackhmmer (uniref90.fasta) query in 1489.951
seconds
INFO colossalai - root - INFO: Launching subprocess
"/anaconda/envs/fastfold_py38/bin/hhsearch -i /tmp/tmp8xdycgi4/query.a3m -o
/tmp/tmp8xdycgi4/output.hhr -maxseq 1000000 -cpu 24 -d ../databases/pdb70/pdb70"
INFO colossalai - root - INFO: Started HHsearch query
[07/25/22 11:11:43] INFO colossalai - root - INFO: Finished HHsearch query in 41.348 seconds
[07/25/22 11:11:44] INFO colossalai - root - INFO: Launching subprocess
"/anaconda/envs/fastfold_py38/bin/jackhmmer -o /dev/null -A
/tmp/tmpnul0i_bi/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001
-E 0.0001 --cpu 24 -N 1 ./output_4gpu/tmp.fasta
../databases/mgnify/mgy_clusters_2018_12.fa"
INFO colossalai - root - INFO: Started Jackhmmer (mgy_clusters_2018_12.fa) query
[E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804531 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804731 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804734 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 21309 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 21310) of binary: /anaconda/envs/fastfold_py38/bin/python
Traceback (most recent call last):
File "/anaconda/envs/fastfold_py38/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/anaconda/envs/fastfold_py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/anaconda/envs/fastfold_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/anaconda/envs/fastfold_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/anaconda/envs/fastfold_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/anaconda/envs/fastfold_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
inference.py FAILED
------------------------------------------------------
Failures:
[1]:
time : 2022-07-25_11:16:19
host : localhost
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 21311)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 21311
[2]:
time : 2022-07-25_11:16:19
host : localhost
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 21312)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 21312
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-07-25_11:16:19
host : localhost
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 21310)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 21310
And the gpu resource seems not been used.

Hi, you can try the current latest main branch which fixed this timeout issue. The reason for the timeout is that multiprocessing inference launch multiple processes and uses one of them for data preprocessing (alignment etc), which causes the barrier to timeout.