pysot RuntimeError: NCCL error in

RuntimeError: NCCL error in

Open leidriver201120 opened this issue 2 years ago • 1 comments

当我在linux服务器上用两个GPU尝试train的时候，遇到一个报错， return torch._C._dist_broadcast(tensor, src, group) RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group): NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error 这个报错来源于train.py的main函数中的dist_model = DistModule(mode)。然后命令为 CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 --master_port=2333 tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml 我在网上寻找了答案，有人说把它改为使用单GPU解决了问题，但是这样不是就失去了多GPU并行跑的优势了吗目前我还未找到解决的办法，如果有人知道解决的方法，劳烦解答，不甚感激

Jun 12 '22 03:06 leidriver201120

by installing cudatoolkit=10.2 module may remove this error.

Jul 18 '23 09:07 Sourabh9468

pysot pysot copied to clipboard

RuntimeError: NCCL error in

pysot
pysot copied to clipboard