File "train.py", line 191, in
main()
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "train.py", line 120, in main
torch.distributed.broadcast(seed, src=0)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
torch.distributed.broadcast(seed, src=0)
这个在A100上有报错,请问您有遇到过或者知道解决方法吗