apex
apex copied to clipboard
RuntimeError: Function 'MmBackward' returned nan values in its 0th output.
I use
model, optimizer = amp.initialize(model, optimizer, opt_level='O1' )
and my loss backwards is handled by
with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()
Things work well without amp.initialize, so I guess the problem is triggered by the mix precision... Can anyone give me a hint for solving such problem? Thanks!
Hi @JizeCao,
could you check your loss and see, if it's a valid value or Inf/Nan? If that's the case, could you run the code again using Anomaly Detection as this should point to the forward method, which created the invalid values.
The loss is a valid value. The anomaly detection shows the backtrace is
queries = self._query_projection(queries) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 92, in forward return F.linear(input, self.weight, self.bias) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/nn/functional.py", line 1408, in linear output = input.matmul(weight.t()) File "/home/caojize/anaconda3/envs/r2c/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs)
This computation is used for multi-head attention over query, key and value.
Thanks for the information!
Could you check, which queries tensor creates this issue?
Based on the stack trace I would guess you encounter an overflow in input.matmul(weight.t()), so the weight parameter used in the linear layer in _query_projection would also be interesting to see.
What do you mean by "which queries tensor creates this issue?" ? It seems like the output of _query_projection doesn't have inf/nan value. I check the weight of that function, and the weight is in type float32, whereas the input variable queries is in type float16. Not sure whether this will create the issue...
@ptrblck , I observed similar issues for training embeddings on classification task with a large number of classes. I tried both optimizations O1 and O2 and was able to solve this issue for O1 with setting max_loss_scale parameter in apex.amp.initialize to 2^13, but it doesn't help with O2 in my task where occasionally NaN gradients occur at backward pass. I prepared the code sample to reproduce this issue. It works without Nans in backward only when parameter amp_max_loss_scale on line 14 of ampO2.py set not greater then 2^3, otherwise the code fails, I'm using PyTorch 1.2
@SergeyMilyaev , I'm running into similar issue. When I'm trying to run your example I'm getting: TypeError: initialize() got an unexpected keyword argument 'max_loss_scale' I downloaded apex a few days ago, do you know if something might have changed since you posted? Thank you
@zlenyk , as I see in the current documentation and code, max_loss_scale should be a valid option.
Hi @JizeCao, Have you solved this issue? I also encountered the same question.