EfficientUnet-PyTorch icon indicating copy to clipboard operation
EfficientUnet-PyTorch copied to clipboard

How to use it with Multi GPU

Open Hesene opened this issue 6 years ago • 12 comments

Thank you for your sharing!!! when I run with single GPU,it runs well, but when I run with multi GPU, it occur error RuntimeError: Function CatBackward returned an invalid gradient at index 1 - expected device cuda:1 but got cuda:0 could you give some advice on this error?

Hesene avatar Aug 10 '19 03:08 Hesene

@Hesene Hello Hesene, in my lab I only have one single 2080Ti, therefore I cannot replicate this issue. I'm sorry about it!

zhoudaxia233 avatar Aug 13 '19 15:08 zhoudaxia233

@Hesene Hello Hesene, in my lab I only have one single 2080Ti, therefore I cannot replicate this issue. I'm sorry about it!

Ok, thank you for your code, it help me a lot

Hesene avatar Aug 13 '19 15:08 Hesene

I face the same problem. Which part is the cause?

AtsunoriFujita avatar Oct 15 '19 20:10 AtsunoriFujita

did you use torch.nn.DataParallel()?

goodgoodstudy92 avatar Jan 17 '20 11:01 goodgoodstudy92

did you use torch.nn.DataParallel()?

no I didn't, but I think it may work

zhoudaxia233 avatar Jan 18 '20 04:01 zhoudaxia233

I face the same problem. Which part is the cause?

I'm not sure, but I think you can try to integrate nn.DataParallel() into the source code

zhoudaxia233 avatar Jan 18 '20 04:01 zhoudaxia233

I face the same problem. Which part is the cause?

I'm not sure, but I think you can try to integrate nn.DataParallel() into the source code

I use efficientnet as backbone to trian a object detection model, and the nn.DataParallel() works fine, the only issue is the speed of multi gpu is quit slow

goodgoodstudy92 avatar Jan 18 '20 05:01 goodgoodstudy92

I'm seeing a similar issue when running with nn.DataParallel:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/efficientunet/efficientunet.py", line 106, in forward
    x = torch.cat([x, blocks.popitem()[1]], dim=1)
RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1

Any ideas?

Thanks!

ryanstout avatar Apr 25 '20 18:04 ryanstout

I'm seeing a similar issue when running with nn.DataParallel:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ryanstout/.local/share/virtualenvs/arsenal_train2-TlJZ47AR/lib/python3.7/site-packages/efficientunet/efficientunet.py", line 106, in forward
    x = torch.cat([x, blocks.popitem()[1]], dim=1)
RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1

Any ideas?

Thanks!

Hi, bro. Are you solved the problem?

Vipermdl avatar Sep 22 '20 02:09 Vipermdl

I suspect that this problem is due to the sharing of a certain module in Efficientunet, which results in this module being only on one GPU, perhaps the encoder……

If-only1 avatar Nov 08 '20 12:11 If-only1

I suspect that this problem is due to the sharing of a certain module in Efficientunet, which results in this module being only on one GPU, perhaps the encoder……

I agree, I'm now facing the same problem.

TianyiFranklinWang avatar Mar 13 '21 07:03 TianyiFranklinWang

@NPU-Franklin Franklin created a PR (#11 ) to support multi GPUs. I do not have multi cards therefore I cannot test it. But maybe you can give it a try.

zhoudaxia233 avatar Apr 20 '21 09:04 zhoudaxia233