王磊
Results
1
issues of
王磊
当我在linux服务器上用两个GPU尝试train的时候,遇到一个报错, return torch._C._dist_broadcast(tensor, src, group) RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group): NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:600, unhandled cuda error 这个报错来源于train.py的main函数中的dist_model = DistModule(mode)。 然后命令为 CUDA_VISIBLE_DEVICES=1,2 python...