lion-pytorch Always getting NaNs in long training

Always getting NaNs in long training

Open danbochman opened this issue 1 year ago • 5 comments

I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:

Models of different sizes 0.2B, 0.7B and 1B params.
Betas such as beta1 0.95 and beta2 0.98
Learning rates 1e-4, 3e-5 and 1e-5.
Triton kernel turned both True and False.

Training was indeed fast but unfortunately in the end always ended up yielding NaNs.

I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).

Dec 05 '23 11:12 danbochman

I have same problem :(

Jan 03 '24 06:01 ysesst93013

same NaN issue with CosineAnnealing scheduler after the first epoch.

Jan 28 '24 00:01 SergeySakharovskiy

I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:

Models of different sizes 0.2B, 0.7B and 1B params.

Betas such as beta1 0.95 and beta2 0.98

Learning rates 1e-4, 3e-5 and 1e-5.

Triton kernel turned both True and False.

Training was indeed fast but unfortunately in the end always ended up yielding NaNs.

I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).

May I know the learning rate schedule you are using?

Jan 29 '24 22:01 xiangning-chen

same issue, i set a big weight decay to avoid it. i suppose that 'update=symbol * lr' enlarging abs(parameter) while symbol not changing.

Mar 12 '24 04:03 zjutzyl

same here. sudden nan losses during 100 e training with onecyclelr and clipping

Apr 16 '24 04:04 lindakasabian

lion-pytorch lion-pytorch copied to clipboard

Always getting NaNs in long training

lion-pytorch
lion-pytorch copied to clipboard