inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0
Environment: Python 3.5 torch 1.1.0 torchvision 0.3.0
Reproducible example:
import torch
import torchvision
model = torchvision.models.inception_v3().cuda()
model = torch.nn.DataParallel(model, [0, 1])
x = torch.rand((8, 3, 299, 299)).cuda()
model.forward(x)
Error:
Traceback (most recent call last): File "
", line 1, in File "env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.gather(outputs, self.output_device) File "/env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather return gather(outputs, output_device, dim=self.dim) File "/env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: new() missing 1 required positional argument: 'aux_logits'
I guess the error occurs because the output of inception_v3 was changed from tuple to namedtuple.
Yes, that's probably the reason.
I believe we have three options:
- remove
namedtupleand usetuple, as before, so basically reverting some of the changes in https://github.com/pytorch/vision/pull/828 - fix PyTorch
DataParallelto supposenamedtuplehttps://github.com/pytorch/pytorch/blob/c8b5f1d2f8f31781e664917f132af31a9abf9cbd/torch/nn/parallel/scatter_gather.py#L5-L31 - encourage the use of
DistributedDataParallelinstead, and do nothing.
I'd vote for option number 2.
ccing @TheCodez and @Separius , who have commented / sent the aforementioned PR initially. What are your thoughts here?
@fmassa I agree option 2 would be the best to avoid problems in the future
@fmassa yeah second option makes the most sense
The problem seems still there.
I've made a little trick of detouring this unsupported namedtuple problem.
It's a kind of mixed solution of @fmassa 's option 1 and 2.
It doesn't change inception_v3 of torchvision.models but change namedtuple to dict at the parallel parts.
Change gether function in scatter_gather.py file to below.
def gather(outputs, target_device, dim=0):
r"""
Gathers tensors from different GPUs on a specified device
(-1 means the CPU).
"""
def gather_map(outputs):
def isnamedtupleinstance(x):
t = type(x)
b = t.__bases__
if len(b) != 1 or b[0] != tuple: return False
f = getattr(t, '_fields', None)
if not isinstance(f, tuple): return False
return all(type(n)==str for n in f)
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isnamedtupleinstance(out):
outputs = [dict(out._asdict()) for out in outputs]
out = outputs[0]
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
return type(out)(map(gather_map, zip(*outputs)))
# Recursive function calls like this create reference cycles.
# Setting the function to None clears the refcycle.
try:
res = gather_map(outputs)
finally:
gather_map = None
return res
And you can get the result of inception_v3 model by below.
outputs, aux_outputs = self.model(imgs).values()
Don't forget to add .values() at the end.
I know this is not the best solution. But I just hope this could help someone for now.
I tried out your solution @YongWookHa, however, now I am getting an error to calculate loss function Error:
File "/home/min/a/ghosh37/distiller/distiller/apputils/image_classifier.py", line 588, in train loss = criterion(output, target) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 916, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 2009, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1317, in log_softmax ret = input.log_softmax(dim) AttributeError: 'dict_values' object has no attribute 'log_softmax'
EDIT: I figured out the problem. Was an issue with dict.
I tried out your solution @YongWookHa, however, now I am getting an error to calculate loss function Error:
File "/home/min/a/ghosh37/distiller/distiller/apputils/image_classifier.py", line 588, in train loss = criterion(output, target) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 916, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 2009, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1317, in log_softmax ret = input.log_softmax(dim) AttributeError: 'dict_values' object has no attribute 'log_softmax'EDIT: I figured out the problem. Was an issue with dict.
I think you forgot to add .values() when you get outputs from your inception model.
So, have you solved the problem?
Yes, I did add values, but I was copying model.values only to single output instead of output, aux_output, and so when computing loss function on dict instead of a tensor, I got the error.
Thanks, but your method solved me hours of training time. Earlier, I had to train inception only one a single GPU, not modifying pytorch file using your code, I am able to train on more than 1 GPU.
I tried out your solution @YongWookHa, but got an error as shown below:
`train Loss: 0.9664 Acc: 0.5738
Traceback (most recent call last):
File "/home/xxx/anaconda3/envs/torch0721/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "
Could you please give me some suggestions?
Edit: fixed. As there is no need to use the aux classifiers for inference, i change the code to:
if phase == 'train':
outputs, aux_outputs = model(inputs).values()
loss1 = criterion(outputs, labels)
loss2 = criterion(aux_outputs, labels)
loss = loss1 + 0.4 * loss2
else:
outputs = model(inputs)
loss = criterion(outputs, labels)
Thanks!
I used APEX.amp with inceptionv3, got the same problem:
- APEX 0.1
- torch 1.13
- torchvision 0.13
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/apex/amp/_initialize.py", line 198, in new_fwd
return applier(output, output_caster)
File "/opt/conda/lib/python3.8/site-packages/apex/amp/_initialize.py", line 51, in applier
return type(value)(applier(v, fn) for v in value)
TypeError: __new__() missing 1 required positional argument: 'aux_logits'
To solve this problem, I replaced namedtuple to function returning tuple, and it works:
torchvision.models.inception.InceptionOutputs = lambda a,b:(a,b)