pretrained-models.pytorch
pretrained-models.pytorch copied to clipboard
How to use Multi-GPU with pytorch1.0?
when I use DataParallel, I encounter an error "Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0", Why?
I confirm.
When I use 2 GPUs and call resnet50 from torchvision => works well.
Call it from Cadene pre-trained models => RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
After model is initialized I apply:
model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()
ping @Cadene
Wrapper like:
class Net(nn.Module):
def __init__(self, model):
super(Net, self).__init__()
self.l1 = nn.Sequential(*list(model.children())[:-1]).to('cuda:0')
self.last = list(model.children())[-1]
def forward(self, x):
x = self.l1(x)
x = x.view(x.size()[0], -1)
x = self.last(x)
return x
Partially solves the problem, but I would prefer that it would work without this hack.
@Cadene
I encountered this problem after loaded the pretrained inception-v3 model.
model = pretrainedmodels.__dict__['inceptionv3'](num_classes=1000, pretrained='imagenet')
model._modules['last_linear']=nn.Linear(in_features=2048, out_features=2, bias=True)
model = nn.Sequential(model)
model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()
I tried to wrap the model, this time I encountered the out of GPU memory error. I'm using 2 12GB TitanXP
Traceback (most recent call last):
File "prunner_v3.py", line 317, in <module>
fine_tuner.train(epoches = 10)
File "prunner_v3.py", line 183, in train
self.train_epoch(optimizer)
File "prunner_v3.py", line 203, in train_epoch
self.train_batch(optimizer, batch.cuda(), label.cuda(), rank_filters)
File "prunner_v3.py", line 197, in train_batch
self.criterion(self.model(input)[0], Variable(label)).backward()
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "prunner_v3.py", line 279, in forward
x = self.l1(x)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torchvision/models/inception.py", line 213, in forward
branch7x7 = self.branch7x7_2(branch7x7)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torchvision/models/inception.py", line 334, in forward
x = self.bn(x)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
exponential_average_factor, self.eps)
File "/home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 1623, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 50.62 MiB (GPU 0; 11.91 GiB total capacity; 11.11 GiB already allocated; 37.06 MiB free; 39.15 MiB cached)
See https://github.com/pytorch/pytorch/issues/8637 for a discussion on why this happens and how to resolve it.
https://github.com/Cadene/pretrained-models.pytorch/pull/145
@ternaus
In my case, when I set batch_size= 36
It occurs error RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
.
But when I set batch_size=32
, error disappears.
When I use 2 GPUs => works well. Now I want to use only one GPU, it occur problem.
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)
the same error, how to deal with it? thank you
I loved all the issues when moved to https://github.com/rwightman/pytorch-image-models
I loved all the issues when moved to https://github.com/rwightman/pytorch-image-models
thank you for your ideal, it works now.
Has anyone know how to work around with this problem? Or should I rebuild my model by using torchvision's pretrained model?