apex icon indicating copy to clipboard operation
apex copied to clipboard

Multiple independent models, only one requires apex.amp, crash in non-amp CPU model

Open lopuhin opened this issue 4 years ago • 13 comments

I have a use-case where I have a "main" model which is trained with apex.amp at opt_level "O1", and all is fine. But I also have a small supplementary model which does not need mixed precision training and is trained on CPU. When apex.amp is enabled, training the second model (after the first model was trained) crashes with:

File "model.py"
  pred_logits = model(logits)
File "venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
  result = self.forward(*input, **kwargs)
File "model.py", in forward
  return self.linear(x)
File "venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
  result = self.forward(*input, **kwargs)
File "venv/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
  return F.linear(input, self.weight, self.bias)
File "venv/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper
  return orig_fn(*new_args, **kwargs)
File "venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
  ret = torch.addmm(bias, input, weight.t())
File "venv/lib/python3.6/site-packages/apex/amp/wrap.py", line 21, in wrapper
  args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)
File "venv/lib/python3.6/site-packages/apex/amp/utils.py", line 97, in cached_cast
  if cached_x.grad_fn.next_functions[1][0].variable is not x:
AttributeError: 'NoneType' object has no attribute 'next_functions'

This is happening with pytorch 1.3.1 and apex 2ca894da7be755711cbbdf56c74bb7904bfd8417 (latest master), and also happened with 82dac9c9419035110d1ccc49b2608681337903ed.

I'm not sure if this is a bug or me using apex.amp incorrectly - I see that the docs say that amp.initialize should be called only once (which is the case), but does this mean that all models to be used in the process must be passed? Is there a way around this? In this case the models are very unrelated and initializing them at once would be quite inconvenient.

I also created a simple repro - it crashes, but if we remove amp initialization or move the second model to GPU, the crash does not happen:

import torch
from apex import amp
from torchvision.models import resnet34
from torch.optim import SGD

device = torch.device('cuda')
model = resnet34()
optimizer = SGD(model.parameters(), lr=1e-2)
model.to(device)

use_amp = True
if use_amp:
    model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
model(torch.randn(1, 3, 224, 224).to(device))

another_model = resnet34()
output = another_model(torch.randn(1, 3, 224, 224))
print(output.shape)

lopuhin avatar Jan 29 '20 08:01 lopuhin

Same problem in same case. Did you find a solution?

liehtman avatar Aug 24 '20 13:08 liehtman

We switched to O2 opt level which does not have the issue, also now mixed precision training is natively supported in pytorch since 1.6 - that also solves the issue.

lopuhin avatar Aug 24 '20 13:08 lopuhin

Same problem here, but we cannot use O2 opt level because our model will not fully converge in this opt level.

zwithz avatar Mar 11 '21 07:03 zwithz

Did anyone solve this apex error?

tejan-rgb avatar Jul 16 '21 06:07 tejan-rgb

Did anyone solve this apex error?

I solved it by changing the apex/amp/utils.py as following.

# change this line (line 113)
- if cached_x.grad_fn.next_functions[1][0].variable is not x:
# into this
+ if cached_x.grad_fn.next_functions[0][0].variable is not x:

zwithz avatar Sep 14 '21 06:09 zwithz

solved my problem according to your advice, thanks @zwithz

mamunctg avatar Nov 28 '21 07:11 mamunctg

I got an error like

if cached_x.grad_fn.next_functions[0][0].variable is not x:
AttributeError: 'NoneType' object has no attribute 'variable'

It seems cached_x.grad_fn.next_functions[0][0] is None

classicsong avatar Feb 07 '22 03:02 classicsong

Be careful adding the fix that @zwithz mentioned. I'm pretty sure it messed up mixed-precision training for me for me. After removing the fixed months later, everything is back to normal.

fijipants avatar Feb 17 '22 16:02 fijipants

Be careful adding the fix that @zwithz mentioned. I'm pretty sure it messed up mixed-precision training for me for me. After removing the fixed months later, everything is back to normal.

Then, how did you solve that problem?

classicsong avatar Feb 17 '22 17:02 classicsong

Having this issue running https://github.com/SwinTransformer/Transformer-SSL on SWIN-T, using a 3090, with precompiled apex from pip install apex -f https://dl.fbaipublicfiles.com/vissl/packaging/apexwheels/py37_cu113_pyt11/download.html and conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

The fix from the post did allow me to run the training, although I haven't seen any drastic differences so far (fingers crossed) I hope this won't impact the training long term.

The change does seem to prevent it from causing the runtime error due to x not being the parent of the cached x, and in the case where it is training with torch.is_grad_enabled() and x.requires_grad != cached_x.requires_grad:

Observing the values on my end of: cached_x.grad_fn.next_functions[0][0].variable and x In every case, they seemed to be the same and looked somewhat like this:

 Parameter containing:
tensor([ 0.1293,  0.1166,  0.0126,  0.1159,  0.0701, -0.1036, -0.0761,  0.0027,
        -0.1023, -0.0475,  0.1164,  0.0672,  0.1257,  0.0011, -0.0736,  0.0955,
         0.0106,  0.0243, -0.0612,  0.0593, -0.1066,  0.1152,  0.1263,  0.0521,
         0.1124,  0.0876, -0.0551, -0.1252,  0.0190,  0.0906, -0.0148,  0.0121,
         0.1070,  0.0596,  0.1079,  0.0212,  0.0162, -0.0345, -0.0244, -0.0767,
         0.0965,  0.1316,  0.0536,  0.0041, -0.0476, -0.1425, -0.0267, -0.1025,
        -0.1066, -0.0286, -0.0284,  0.0291, -0.1046,  0.1037, -0.1314, -0.0684,
        -0.0548,  0.0089,  0.0597, -0.0380,  0.0225, -0.0342, -0.0568, -0.0202,
         0.0291, -0.1402, -0.1005,  0.1128,  0.0653, -0.0039,  0.0046,  0.0199,
         0.0335, -0.0985, -0.0393, -0.1325, -0.1135, -0.0272, -0.0191,  0.1129,
         0.0249, -0.0234, -0.0040,  0.0806, -0.0437, -0.0270, -0.0290, -0.1164,
        -0.0202, -0.1334, -0.0776, -0.0919,  0.1075, -0.1330,  0.1391,  0.0541],
       device='cuda:0', requires_grad=True) 

 Parameter containing:
tensor([[-0.0170, -0.0239,  0.0477,  ...,  0.0148, -0.0025,  0.0132],
        [ 0.0459, -0.0163, -0.0274,  ...,  0.0240,  0.0403,  0.0145],
        [-0.0264, -0.0373,  0.0041,  ..., -0.0217,  0.0381,  0.0198],
        ...,
        [ 0.0131, -0.0127,  0.0433,  ..., -0.0061,  0.0056, -0.0072],
        [-0.0119,  0.0015,  0.0027,  ...,  0.0111, -0.0128,  0.0144],
        [ 0.0034,  0.0338, -0.0243,  ..., -0.0028, -0.0256,  0.0207]],
       device='cuda:0', requires_grad=True) 

 Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0', requires_grad=True)

They also both had requires_grad=True and so in my case, it ends up using the cached x. I hope that the different index doesn't affect my training.

Giles-Billenness avatar Apr 01 '22 22:04 Giles-Billenness

Be careful adding the fix that @zwithz mentioned. I'm pretty sure it messed up mixed-precision training for me for me. After removing the fixed months later, everything is back to normal.

Then, how did you solve that problem?

我将代码这行代码

  • if cached_x.grad_fn.next_functions[0][0].variable is not x: 改为
  • if cached_x.grad_fn.next_functions[1][0].variable is not x: 运行成功了(⊙o⊙)…

Rocky1salady-killer avatar Jun 21 '22 11:06 Rocky1salady-killer

I got the same error when I want to use one BERT model to embed two sentences. The model always crashes with "AttributeError: 'NoneType' object has no attribute 'next_functions'" whatever the second sentence is. Strangely, I can run your simple repro successfully. To explore the details, I debug my code and find that it won't go into if is_nested(x): or if x in cache: in apex.amp.utils.cached_cast(cast_fn, x, cache) when dealing with the first sentence, and the parameter cache keeps growing. However, when it just starts with the second sentence, the it will go into if x in cache: and wrong.

bigbrother001 avatar Apr 16 '23 03:04 bigbrother001

We switched to O2 opt level which does not have the issue, also now mixed precision training is natively supported in pytorch since 1.6 - that also solves the issue.

Thanks, it is useful

bigbrother001 avatar Apr 16 '23 07:04 bigbrother001