denoising-diffusion-pytorch icon indicating copy to clipboard operation
denoising-diffusion-pytorch copied to clipboard

getting NaN loss

Open ankanbhunia opened this issue 1 year ago • 12 comments

I tried two datasets that consists of 10k and 30k images respectively. Can you please tell me if I need to change any hyperparameter?

ankanbhunia avatar Jul 12 '22 03:07 ankanbhunia

Which dataset are you running on? I have similar issues on CIFAR 10, I am getting inf loss.

DushyantSahoo avatar Jul 13 '22 16:07 DushyantSahoo

Hi Ankan and Dushyant! Do you both want to try v0.26.0 and see if it fixes the problem? https://github.com/lucidrains/denoising-diffusion-pytorch/commit/555566c1885cf554f3d8d2cb74539031f71444e4

lucidrains avatar Jul 18 '22 18:07 lucidrains

I was also getting NaN loss after a few thousand steps on versions 0.25.3 AND 0.26.1. This was happening with both GaussianDiffusion and ElucidatedDiffusion For me the solution was setting amp = False since I had enabled it earlier. Hope this helps

trufty avatar Jul 20 '22 23:07 trufty

@trufty It works for me. Thank you !!

inooni avatar Jul 21 '22 04:07 inooni

12f95b33d8a2f44ffb48ec1079083ec634c4728f seems to have fixed this issue for me. I can set amp = True again without NaN.

trufty avatar Jul 28 '22 02:07 trufty

Unfortunately I'm still getting this error (with amp=True) on v0.27.4.

jwuphysics avatar Aug 17 '22 19:08 jwuphysics

Which dataset are you running on? I have similar issues on CelebA_align, and learn_lr=1e-5. I am getting inf loss. on v0.27.4

12f95b3 seems to have fixed this issue for me. I can set amp = True again without NaN.

KANGXI123456 avatar Aug 18 '22 11:08 KANGXI123456

Its a custom dataset somewhat similar to the Oxford flowers dataset. I was successfully using v0.26.4 with amp = True. I haven't tried the latest version yet. I'm using the default LR.

trufty avatar Aug 18 '22 13:08 trufty

I'm using a custom astronomical dataset. I may have set the lr too high (was at 1e-4), but in any event this worked with amp=False.

Also, I tried to load the state dict for the model + optimizer before everything went NaN, but I wasn't able to train again -- the loss was still stuck at NaN

jwuphysics avatar Aug 18 '22 13:08 jwuphysics

I just tested 0.27.4 and I also get NaN loss again with amp = True within the first 2k steps. So I rolled back to 0.26.4 and I'm already at 4k steps with no NaN loss.

Anyone else want to try 0.26.4 to confirm it's working at that version?

trufty avatar Aug 18 '22 14:08 trufty

I am getting NaN Losses with the learned positional embeddings, when turning it off, the loss is fine.

It seems using half precision is the issue, when replacing fp16 with bf16 the model trains fine for me.

p-sodmann avatar Aug 22 '22 11:08 p-sodmann

I was getting NAN for a custom dataset under the amp = True flag also (version 0.27.9). So, I changed the flag to false and the training is doing good.

heitorrapela avatar Sep 16 '22 23:09 heitorrapela