stylegan2-pytorch
stylegan2-pytorch copied to clipboard
nan loss
Hi, Thanks for sharing your code. I used your code to train on FFHQ dataset. The resolution is 128X128. All losses become NaN after few training iterations. The training losses show that they go to infinity immediately. Do you have any suggestions?
Could you let me know your batch sizes? Total batch size is --batch
* (number of gpus).
my batchsize is 4, I use single GPU to train the model. Even in CelebA dataset, I use your original code to create the lmdb data and then train the model. The loss become NaN after few iterations.
I think batch size 4 is quite small for stable training, especially with path length regularization. You may need to adjust learning rate and disable path length regularization. (you can simply increase --g_reg_every
to the number that larger than training iterations.)
I increased the --batch from 4 to 8, decreased --lr from 0.002 to 0.0002 and set --g_reg_every to be larger than the total training iterations. The loss still become NaN after few iterations. I found that even if I disable the path length regularization and the gradient penalty, and only use generator's loss and discriminator's loss. The loss still become NaN.
I got reasonable results with batch size 4 but I got NaN when I use other initialization of the network parameters other than the pytorch default
I am also facing this same issue. Essentially I get this problem when I try to train it in a distributed manner. I tried every adjustment of the regularization parameters, batch sizes and learning rates. But somehow it doesn't work. If I make it single GPU it works or two GPUs. The more I try to scale the training, it fails more easily
I think the nan issue might come from the softplus loss. Do not know why but the backward hook fails on softplus.
Update on my issue: there was a problem with one of my GPUs in my multi node multi GPU setup. Some gate must have been broken.