stylegan2-pytorch nan loss

Hi, Thanks for sharing your code. I used your code to train on FFHQ dataset. The resolution is 128X128. All losses become NaN after few training iterations. The training losses show that they go to infinity immediately. Do you have any suggestions?

Apr 05 '21 23:04 SenHe

Could you let me know your batch sizes? Total batch size is --batch * (number of gpus).

Apr 06 '21 11:04 rosinality

my batchsize is 4, I use single GPU to train the model. Even in CelebA dataset, I use your original code to create the lmdb data and then train the model. The loss become NaN after few iterations.

Apr 06 '21 12:04 SenHe

I think batch size 4 is quite small for stable training, especially with path length regularization. You may need to adjust learning rate and disable path length regularization. (you can simply increase --g_reg_every to the number that larger than training iterations.)

Apr 06 '21 14:04 rosinality

I increased the --batch from 4 to 8, decreased --lr from 0.002 to 0.0002 and set --g_reg_every to be larger than the total training iterations. The loss still become NaN after few iterations. I found that even if I disable the path length regularization and the gradient penalty, and only use generator's loss and discriminator's loss. The loss still become NaN.

Apr 06 '21 15:04 SenHe

I got reasonable results with batch size 4 but I got NaN when I use other initialization of the network parameters other than the pytorch default

Jun 18 '21 13:06 zengxianyu

I am also facing this same issue. Essentially I get this problem when I try to train it in a distributed manner. I tried every adjustment of the regularization parameters, batch sizes and learning rates. But somehow it doesn't work. If I make it single GPU it works or two GPUs. The more I try to scale the training, it fails more easily

Mar 18 '22 11:03 codeislife99

I think the nan issue might come from the softplus loss. Do not know why but the backward hook fails on softplus.

Apr 09 '22 16:04 FeryET

Update on my issue: there was a problem with one of my GPUs in my multi node multi GPU setup. Some gate must have been broken.

Apr 09 '22 16:04 codeislife99

stylegan2-pytorch stylegan2-pytorch copied to clipboard

nan loss

stylegan2-pytorch
stylegan2-pytorch copied to clipboard