SA-SSD icon indicating copy to clipboard operation
SA-SSD copied to clipboard

multi-GPU trainning error

Open vehxianfish opened this issue 3 years ago • 0 comments

I use multi-GPU trainning,but errors occurs:

Traceback (most recent call last):
  File "./train.py", line 131, in <module>
    main()
  File "./train.py", line 82, in main
    model = MMDistributedDataParallel(model.cuda(),find_unused_parameters=True)
  File "/home/ubuntu-502/xu/CIA-SSD/envs/CIA-SSD/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
    self._ddp_init_helper()
  File "/home/ubuntu-502/xu/CIA-SSD/envs/CIA-SSD/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
    self._module_copies = replicate(self.module, self.device_ids, detach=True)
  File "/home/ubuntu-502/xu/CIA-SSD/envs/CIA-SSD/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/home/ubuntu-502/xu/CIA-SSD/envs/CIA-SSD/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 67, in _broadcast_coalesced_reshape
    return comm.broadcast_coalesced(tensors, devices)
  File "/home/ubuntu-502/xu/CIA-SSD/envs/CIA-SSD/lib/python3.6/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

Any one meet this question or can help me to check this errors? Thank you very much~~~

vehxianfish avatar Apr 18 '21 03:04 vehxianfish