vision icon indicating copy to clipboard operation
vision copied to clipboard

inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0

Open QizhongYao opened this issue 6 years ago • 9 comments

Environment: Python 3.5 torch 1.1.0 torchvision 0.3.0

Reproducible example: import torch import torchvision model = torchvision.models.inception_v3().cuda() model = torch.nn.DataParallel(model, [0, 1]) x = torch.rand((8, 3, 299, 299)).cuda() model.forward(x)

Error:

Traceback (most recent call last): File "", line 1, in File "env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.gather(outputs, self.output_device) File "/env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather return gather(outputs, output_device, dim=self.dim) File "/env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: new() missing 1 required positional argument: 'aux_logits'

I guess the error occurs because the output of inception_v3 was changed from tuple to namedtuple.

QizhongYao avatar Jun 25 '19 00:06 QizhongYao

Yes, that's probably the reason.

I believe we have three options:

  1. remove namedtuple and use tuple, as before, so basically reverting some of the changes in https://github.com/pytorch/vision/pull/828
  2. fix PyTorch DataParallel to suppose namedtuple https://github.com/pytorch/pytorch/blob/c8b5f1d2f8f31781e664917f132af31a9abf9cbd/torch/nn/parallel/scatter_gather.py#L5-L31
  3. encourage the use of DistributedDataParallel instead, and do nothing.

I'd vote for option number 2.

ccing @TheCodez and @Separius , who have commented / sent the aforementioned PR initially. What are your thoughts here?

fmassa avatar Jun 25 '19 16:06 fmassa

@fmassa I agree option 2 would be the best to avoid problems in the future

TheCodez avatar Jun 25 '19 18:06 TheCodez

@fmassa yeah second option makes the most sense

Separius avatar Jun 25 '19 19:06 Separius

The problem seems still there. I've made a little trick of detouring this unsupported namedtuple problem.

It's a kind of mixed solution of @fmassa 's option 1 and 2. It doesn't change inception_v3 of torchvision.models but change namedtuple to dict at the parallel parts.

Change gether function in scatter_gather.py file to below.

def gather(outputs, target_device, dim=0):
    r"""
    Gathers tensors from different GPUs on a specified device
      (-1 means the CPU).
    """
    def gather_map(outputs):
        def isnamedtupleinstance(x):
            t = type(x)
            b = t.__bases__
            if len(b) != 1 or b[0] != tuple: return False
            f = getattr(t, '_fields', None)
            if not isinstance(f, tuple): return False
            return all(type(n)==str for n in f)
            
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
            
        if isnamedtupleinstance(out):
            outputs = [dict(out._asdict()) for out in outputs]
            out = outputs[0]

        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))           
        
        return type(out)(map(gather_map, zip(*outputs)))

    # Recursive function calls like this create reference cycles.
    # Setting the function to None clears the refcycle.
    try:
        res = gather_map(outputs)
    finally:
        gather_map = None
    return res

And you can get the result of inception_v3 model by below.

outputs, aux_outputs = self.model(imgs).values()

Don't forget to add .values() at the end.

I know this is not the best solution. But I just hope this could help someone for now.

YongWookHa avatar Oct 31 '19 09:10 YongWookHa

I tried out your solution @YongWookHa, however, now I am getting an error to calculate loss function Error:

File "/home/min/a/ghosh37/distiller/distiller/apputils/image_classifier.py", line 588, in train loss = criterion(output, target) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 916, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 2009, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1317, in log_softmax ret = input.log_softmax(dim) AttributeError: 'dict_values' object has no attribute 'log_softmax'

EDIT: I figured out the problem. Was an issue with dict.

soumendukrg avatar Nov 15 '19 18:11 soumendukrg

I tried out your solution @YongWookHa, however, now I am getting an error to calculate loss function Error:

File "/home/min/a/ghosh37/distiller/distiller/apputils/image_classifier.py", line 588, in train loss = criterion(output, target) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 916, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 2009, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/home/min/a/ghosh37/distiller/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1317, in log_softmax ret = input.log_softmax(dim) AttributeError: 'dict_values' object has no attribute 'log_softmax'

EDIT: I figured out the problem. Was an issue with dict.

I think you forgot to add .values() when you get outputs from your inception model. So, have you solved the problem?

YongWookHa avatar Nov 18 '19 08:11 YongWookHa

Yes, I did add values, but I was copying model.values only to single output instead of output, aux_output, and so when computing loss function on dict instead of a tensor, I got the error.

Thanks, but your method solved me hours of training time. Earlier, I had to train inception only one a single GPU, not modifying pytorch file using your code, I am able to train on more than 1 GPU.

soumendukrg avatar Nov 19 '19 21:11 soumendukrg

I tried out your solution @YongWookHa, but got an error as shown below:

`train Loss: 0.9664 Acc: 0.5738

Traceback (most recent call last): File "/home/xxx/anaconda3/envs/torch0721/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 153, in num_epochs=25, is_inception=True) File "", line 91, in train_model outputs, aux_outputs = model(inputs).values() RuntimeError: Could not run 'aten::values' with arguments from the 'CUDA' backend. 'aten::values' is only available for these backends: [SparseCPU, SparseCUDA, Autograd, Profiler, Tracer].`

Could you please give me some suggestions?

Edit: fixed. As there is no need to use the aux classifiers for inference, i change the code to:

if phase == 'train':

    outputs, aux_outputs = model(inputs).values()
    loss1 = criterion(outputs, labels)
    loss2 = criterion(aux_outputs, labels)
    loss = loss1 + 0.4 * loss2

else:

    outputs = model(inputs)
    loss = criterion(outputs, labels)

Thanks!

sanka4rea avatar Aug 06 '20 09:08 sanka4rea

I used APEX.amp with inceptionv3, got the same problem:

  • APEX 0.1
  • torch 1.13
  • torchvision 0.13
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/apex/amp/_initialize.py", line 198, in new_fwd
    return applier(output, output_caster)
  File "/opt/conda/lib/python3.8/site-packages/apex/amp/_initialize.py", line 51, in applier
    return type(value)(applier(v, fn) for v in value)
TypeError: __new__() missing 1 required positional argument: 'aux_logits'

To solve this problem, I replaced namedtuple to function returning tuple, and it works:

torchvision.models.inception.InceptionOutputs = lambda a,b:(a,b)

QiangZiBro avatar Aug 17 '22 07:08 QiangZiBro