manual_backward + fp16 training doesn't converge

Open DrJimFan opened this issue 3 years ago • 1 comments

Hi, I borrowed some snippets from your codebase for the distributed GPU and minibatch-within-batch training in my own project. However, I found that training using manual_backward() + FP16 does not converge at all. If I switch to FP32, training works without any other code modifications. I'm using the latest pytorch-lightning v1.6.3. I wonder if you have observed similar issues?

May 09 '22 07:05 DrJimFan

I saw something similar, fwiw -- exploding gradients in the gradient rescaling from the very first forward pass. I read in other threads online that this is somewhat common in transformer architectures, especially those that include parameters smaller than the smallest possible 16bit float -- 6.1e-5, which is I guess not uncommon.

Sep 23 '22 18:09 RZachLamberty