EfficientUnet-PyTorch
EfficientUnet-PyTorch copied to clipboard
How to use it with Multi GPU
Thank you for your sharing!!! when I run with single GPU,it runs well, but when I run with multi GPU, it occur error
RuntimeError: Function CatBackward returned an invalid gradient at index 1 - expected device cuda:1 but got cuda:0
could you give some advice on this error?
@Hesene Hello Hesene, in my lab I only have one single 2080Ti, therefore I cannot replicate this issue. I'm sorry about it!
@Hesene Hello Hesene, in my lab I only have one single 2080Ti, therefore I cannot replicate this issue. I'm sorry about it!
Ok, thank you for your code, it help me a lot
I face the same problem. Which part is the cause?
did you use torch.nn.DataParallel()?
did you use torch.nn.DataParallel()?
no I didn't, but I think it may work
I face the same problem. Which part is the cause?
I'm not sure, but I think you can try to integrate nn.DataParallel() into the source code
I face the same problem. Which part is the cause?
I'm not sure, but I think you can try to integrate
nn.DataParallel()into the source code
I use efficientnet as backbone to trian a object detection model, and the nn.DataParallel() works fine, the only issue is the speed of multi gpu is quit slow
I'm seeing a similar issue when running with nn.DataParallel:
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/efficientunet/efficientunet.py", line 106, in forward
x = torch.cat([x, blocks.popitem()[1]], dim=1)
RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1
Any ideas?
Thanks!
I'm seeing a similar issue when running with nn.DataParallel:
RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/efficientunet/efficientunet.py", line 106, in forward x = torch.cat([x, blocks.popitem()[1]], dim=1) RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1Any ideas?
Thanks!
Hi, bro. Are you solved the problem?
I suspect that this problem is due to the sharing of a certain module in Efficientunet, which results in this module being only on one GPU, perhaps the encoder……
I suspect that this problem is due to the sharing of a certain module in Efficientunet, which results in this module being only on one GPU, perhaps the encoder……
I agree, I'm now facing the same problem.
@NPU-Franklin Franklin created a PR (#11 ) to support multi GPUs. I do not have multi cards therefore I cannot test it. But maybe you can give it a try.