denoising-diffusion-pytorch
denoising-diffusion-pytorch copied to clipboard
getting NaN loss
I tried two datasets that consists of 10k and 30k images respectively. Can you please tell me if I need to change any hyperparameter?
Which dataset are you running on? I have similar issues on CIFAR 10, I am getting inf loss.
Hi Ankan and Dushyant! Do you both want to try v0.26.0 and see if it fixes the problem? https://github.com/lucidrains/denoising-diffusion-pytorch/commit/555566c1885cf554f3d8d2cb74539031f71444e4
I was also getting NaN
loss after a few thousand steps on versions 0.25.3 AND 0.26.1. This was happening with both GaussianDiffusion
and ElucidatedDiffusion
For me the solution was setting amp = False
since I had enabled it earlier. Hope this helps
@trufty It works for me. Thank you !!
12f95b33d8a2f44ffb48ec1079083ec634c4728f seems to have fixed this issue for me. I can set amp = True
again without NaN.
Unfortunately I'm still getting this error (with amp=True
) on v0.27.4.
Which dataset are you running on? I have similar issues on CelebA_align, and learn_lr=1e-5. I am getting inf loss. on v0.27.4
12f95b3 seems to have fixed this issue for me. I can set
amp = True
again without NaN.
Its a custom dataset somewhat similar to the Oxford flowers dataset. I was successfully using v0.26.4 with amp = True
. I haven't tried the latest version yet. I'm using the default LR.
I'm using a custom astronomical dataset. I may have set the lr too high (was at 1e-4), but in any event this worked with amp=False
.
Also, I tried to load the state dict for the model + optimizer before everything went NaN, but I wasn't able to train again -- the loss was still stuck at NaN
I just tested 0.27.4 and I also get NaN loss again with amp = True
within the first 2k steps.
So I rolled back to 0.26.4 and I'm already at 4k steps with no NaN loss.
Anyone else want to try 0.26.4 to confirm it's working at that version?
I am getting NaN Losses with the learned positional embeddings, when turning it off, the loss is fine.
It seems using half precision is the issue, when replacing fp16 with bf16 the model trains fine for me.
I was getting NAN for a custom dataset under the amp = True flag also (version 0.27.9). So, I changed the flag to false and the training is doing good.