apex icon indicating copy to clipboard operation
apex copied to clipboard

RuntimeError: Function 'MmBackward' returned nan values in its 0th output.

Open JizeCao opened this issue 6 years ago • 8 comments
trafficstars

I use model, optimizer = amp.initialize(model, optimizer, opt_level='O1' ) and my loss backwards is handled by with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() Things work well without amp.initialize, so I guess the problem is triggered by the mix precision... Can anyone give me a hint for solving such problem? Thanks!

JizeCao avatar Aug 16 '19 22:08 JizeCao

Hi @JizeCao,

could you check your loss and see, if it's a valid value or Inf/Nan? If that's the case, could you run the code again using Anomaly Detection as this should point to the forward method, which created the invalid values.

ptrblck avatar Aug 16 '19 22:08 ptrblck

The loss is a valid value. The anomaly detection shows the backtrace is queries = self._query_projection(queries) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 92, in forward return F.linear(input, self.weight, self.bias) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/nn/functional.py", line 1408, in linear output = input.matmul(weight.t()) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) This computation is used for multi-head attention over query, key and value.

JizeCao avatar Aug 16 '19 22:08 JizeCao

Thanks for the information! Could you check, which queries tensor creates this issue? Based on the stack trace I would guess you encounter an overflow in input.matmul(weight.t()), so the weight parameter used in the linear layer in _query_projection would also be interesting to see.

ptrblck avatar Aug 16 '19 23:08 ptrblck

What do you mean by "which queries tensor creates this issue?" ? It seems like the output of _query_projection doesn't have inf/nan value. I check the weight of that function, and the weight is in type float32, whereas the input variable queries is in type float16. Not sure whether this will create the issue...

JizeCao avatar Aug 17 '19 00:08 JizeCao

@ptrblck , I observed similar issues for training embeddings on classification task with a large number of classes. I tried both optimizations O1 and O2 and was able to solve this issue for O1 with setting max_loss_scale parameter in apex.amp.initialize to 2^13, but it doesn't help with O2 in my task where occasionally NaN gradients occur at backward pass. I prepared the code sample to reproduce this issue. It works without Nans in backward only when parameter amp_max_loss_scale on line 14 of ampO2.py set not greater then 2^3, otherwise the code fails, I'm using PyTorch 1.2

SergeyMilyaev avatar Sep 13 '19 09:09 SergeyMilyaev

@SergeyMilyaev , I'm running into similar issue. When I'm trying to run your example I'm getting: TypeError: initialize() got an unexpected keyword argument 'max_loss_scale' I downloaded apex a few days ago, do you know if something might have changed since you posted? Thank you

zlenyk avatar Dec 16 '19 23:12 zlenyk

@zlenyk , as I see in the current documentation and code, max_loss_scale should be a valid option.

SergeyMilyaev avatar Dec 22 '19 19:12 SergeyMilyaev

Hi @JizeCao, Have you solved this issue? I also encountered the same question.

Tokymin avatar Feb 13 '22 04:02 Tokymin