poly-diffuse icon indicating copy to clipboard operation
poly-diffuse copied to clipboard

#训练时分布式bug

Open CBQ-1223 opened this issue 8 months ago • 0 comments

File "train.py", line 191, in main() File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "train.py", line 120, in main torch.distributed.broadcast(seed, src=0) File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

torch.distributed.broadcast(seed, src=0)

这个在A100上有报错,请问您有遇到过或者知道解决方法吗

CBQ-1223 avatar Jun 12 '24 13:06 CBQ-1223