pretrained-models.pytorch icon indicating copy to clipboard operation
pretrained-models.pytorch copied to clipboard

How to use Multi-GPU with pytorch1.0?

Open cizhenshi opened this issue 5 years ago • 11 comments

when I use DataParallel, I encounter an error "Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0", Why?

cizhenshi avatar Jan 16 '19 10:01 cizhenshi

I confirm.

When I use 2 GPUs and call resnet50 from torchvision => works well.

Call it from Cadene pre-trained models => RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

After model is initialized I apply: model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()

ping @Cadene

ternaus avatar Jan 26 '19 00:01 ternaus

Wrapper like:

class Net(nn.Module):
  def __init__(self, model):
    super(Net, self).__init__()
    self.l1 = nn.Sequential(*list(model.children())[:-1]).to('cuda:0')
    self.last = list(model.children())[-1]

  def forward(self, x):
    x = self.l1(x)
    x = x.view(x.size()[0], -1)
    x = self.last(x)
    return x

Partially solves the problem, but I would prefer that it would work without this hack.

ternaus avatar Feb 24 '19 23:02 ternaus

@Cadene

ternaus avatar Feb 24 '19 23:02 ternaus

I encountered this problem after loaded the pretrained inception-v3 model.

        model = pretrainedmodels.__dict__['inceptionv3'](num_classes=1000, pretrained='imagenet')
        model._modules['last_linear']=nn.Linear(in_features=2048, out_features=2, bias=True)
        model = nn.Sequential(model)
        model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()

I tried to wrap the model, this time I encountered the out of GPU memory error. I'm using 2 12GB TitanXP

Traceback (most recent call last):
  File "prunner_v3.py", line 317, in <module>
    fine_tuner.train(epoches = 10)
  File "prunner_v3.py", line 183, in train
    self.train_epoch(optimizer)
  File "prunner_v3.py", line 203, in train_epoch
    self.train_batch(optimizer, batch.cuda(), label.cuda(), rank_filters)
  File "prunner_v3.py", line 197, in train_batch
    self.criterion(self.model(input)[0], Variable(label)).backward()
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "prunner_v3.py", line 279, in forward
    x = self.l1(x)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torchvision/models/inception.py", line 213, in forward
    branch7x7 = self.branch7x7_2(branch7x7)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torchvision/models/inception.py", line 334, in forward
    x = self.bn(x)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
    exponential_average_factor, self.eps)
  File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 1623, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 50.62 MiB (GPU 0; 11.91 GiB total capacity; 11.11 GiB already allocated; 37.06 MiB free; 39.15 MiB cached)

oscarriddle avatar Mar 12 '19 03:03 oscarriddle

See https://github.com/pytorch/pytorch/issues/8637 for a discussion on why this happens and how to resolve it.

willprice avatar Jun 28 '19 12:06 willprice

https://github.com/Cadene/pretrained-models.pytorch/pull/145

hegc avatar Aug 06 '19 03:08 hegc

@ternaus In my case, when I set batch_size= 36 It occurs error RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution).

But when I set batch_size=32, error disappears.

DonghoonPark12 avatar Oct 06 '19 08:10 DonghoonPark12

When I use 2 GPUs => works well. Now I want to use only one GPU, it occur problem.

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

the same error, how to deal with it? thank you

Gavin-Evans avatar Oct 10 '20 03:10 Gavin-Evans

I loved all the issues when moved to https://github.com/rwightman/pytorch-image-models

ternaus avatar Oct 10 '20 03:10 ternaus

I loved all the issues when moved to https://github.com/rwightman/pytorch-image-models

thank you for your ideal, it works now.

Gavin-Evans avatar Oct 10 '20 07:10 Gavin-Evans

Has anyone know how to work around with this problem? Or should I rebuild my model by using torchvision's pretrained model?

viet2411 avatar Apr 16 '21 05:04 viet2411