lightweight-gan icon indicating copy to clipboard operation
lightweight-gan copied to clipboard

Getting into "NaN detected for generator or discriminator" loop every time

Open artucalvo opened this issue 3 years ago • 2 comments

I have tried running the algorithm on Colab with different datasets (256, 512px), batch sizes (16, 32), aug probabilities (0.25, 0.40) and gradient_accumulate_every (4, 2, 1). However, I always get stuck in less than 1 hour into the NaN loop.

This is one execution example, where GP quickly gets to 10.00. Any thoughts on what is going on?

!lightweight_gan \
    --data $IMAGES_PATH \
    --results_dir $RESULTS_PATH \
    --models_dir $MODELS_PATH \
    --image-size 512 \
    --name LWG \
    --batch-size 32 \
    --gradient-accumulate-every 1 \
    --num-train-steps 1000000 \
    --save_every 1000 \
    --disc_output_size 5 \
    --aug-prob 0.4 \
    --aug-types [translation] \
    --amp \
    --new False
LWG</content/Training/>:   0% 0/1000000 [00:00<?, ?it/s]G: 1.87 | D: 3.59 | SS: 35.82
LWG</content/Training/>:   0% 45/1000000 [01:07<430:59:20,  1.55s/it]G: 1.09 | D: 0.42 | GP: 69.44 | SS: 0.42
LWG</content/Training/>:   0% 93/1000000 [02:30<406:26:33,  1.46s/it]G: 0.90 | D: 1.00 | GP: 12.81 | SS: 0.23
LWG</content/Training/>:   0% 149/1000000 [03:39<398:23:47,  1.43s/it]G: 1.37 | D: 1.33 | GP: 8.62 | SS: 0.25
LWG</content/Training/>:   0% 193/1000000 [04:42<399:40:24,  1.44s/it]G: -0.15 | D: 1.47 | GP: 1.17 | SS: 0.27
LWG</content/Training/>:   0% 249/1000000 [06:06<404:35:24,  1.46s/it]G: 0.55 | D: 1.59 | GP: 11.58 | SS: 0.22
LWG</content/Training/>:   0% 297/1000000 [07:14<392:55:39,  1.41s/it]G: 1.05 | D: 2.17 | GP: 3.01 | SS: 0.20
LWG</content/Training/>:   0% 345/1000000 [08:26<406:13:03,  1.46s/it]G: 0.00 | D: 1.82 | GP: 3.61 | SS: 0.32
LWG</content/Training/>:   0% 393/1000000 [09:34<396:13:20,  1.43s/it]G: 0.88 | D: 1.76 | GP: 0.54 | SS: 0.32
LWG</content/Training/>:   0% 449/1000000 [10:57<397:54:06,  1.43s/it]G: 0.56 | D: 1.48 | GP: 0.37 | SS: 0.39
LWG</content/Training/>:   0% 497/1000000 [12:06<394:28:55,  1.42s/it]G: 0.65 | D: 1.81 | GP: 0.51 | SS: 0.34
LWG</content/Training/>:   0% 545/1000000 [13:16<396:17:34,  1.43s/it]G: -0.23 | D: 1.93 | GP: 0.40 | SS: 0.34
LWG</content/Training/>:   0% 593/1000000 [14:25<391:16:11,  1.41s/it]G: -0.31 | D: 1.95 | GP: 0.22 | SS: 0.27
LWG</content/Training/>:   0% 649/1000000 [15:47<401:20:09,  1.45s/it]G: -0.26 | D: 1.94 | GP: 10.00 | SS: 0.29
LWG</content/Training/>:   0% 697/1000000 [16:54<391:38:29,  1.41s/it]G: -0.39 | D: 2.19 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 745/1000000 [18:05<397:02:57,  1.43s/it]G: -0.21 | D: 2.16 | GP: 10.00 | SS: 0.32
LWG</content/Training/>:   0% 793/1000000 [19:13<391:04:03,  1.41s/it]G: -0.03 | D: 1.65 | GP: 10.00 | SS: 0.36
LWG</content/Training/>:   0% 849/1000000 [20:36<396:24:19,  1.43s/it]G: -0.22 | D: 2.14 | GP: 10.00 | SS: 0.30
LWG</content/Training/>:   0% 897/1000000 [21:44<395:12:13,  1.42s/it]G: -0.29 | D: 2.20 | GP: 10.00 | SS: 0.28
LWG</content/Training/>:   0% 945/1000000 [22:54<393:23:01,  1.42s/it]G: -0.45 | D: 2.20 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 993/1000000 [24:02<391:36:45,  1.41s/it]G: -0.37 | D: 2.16 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 1049/1000000 [25:25<401:18:43,  1.45s/it]G: -0.37 | D: 2.17 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 1097/1000000 [26:33<394:19:55,  1.42s/it]G: -0.43 | D: 2.28 | GP: 10.00 | SS: 0.30
LWG</content/Training/>:   0% 1145/1000000 [27:45<399:24:02,  1.44s/it]G: -0.45 | D: 2.18 | GP: 10.00 | SS: 0.38
LWG</content/Training/>:   0% 1193/1000000 [28:52<390:31:48,  1.41s/it]G: -0.42 | D: 2.27 | GP: 10.00 | SS: 0.28
LWG</content/Training/>:   0% 1249/1000000 [30:15<397:12:58,  1.43s/it]G: -0.49 | D: 2.21 | GP: 10.00 | SS: 0.31
LWG</content/Training/>:   0% 1297/1000000 [31:24<395:35:03,  1.43s/it]G: -0.47 | D: 2.43 | GP: 10.00 | SS: 0.36
LWG</content/Training/>:   0% 1345/1000000 [32:35<397:08:04,  1.43s/it]G: -0.68 | D: 2.40 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 1393/1000000 [33:43<392:13:41,  1.41s/it]G: -0.73 | D: 2.44 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 1449/1000000 [35:06<400:23:47,  1.44s/it]G: -0.88 | D: 2.71 | GP: 10.00 | SS: 0.30
LWG</content/Training/>:   0% 1497/1000000 [36:13<391:56:53,  1.41s/it]G: -0.52 | D: 2.41 | GP: 10.00 | SS: 0.34
LWG</content/Training/>:   0% 1545/1000000 [37:26<403:18:44,  1.45s/it]G: -0.85 | D: 2.75 | GP: 10.00 | SS: 0.31
LWG</content/Training/>:   0% 1593/1000000 [38:33<391:32:21,  1.41s/it]G: -1.02 | D: 2.75 | GP: 10.00 | SS: 0.33
LWG</content/Training/>:   0% 1649/1000000 [39:56<394:39:26,  1.42s/it]G: -0.71 | D: 2.79 | GP: 10.00 | SS: 0.39
LWG</content/Training/>:   0% 1697/1000000 [41:05<394:23:49,  1.42s/it]G: -0.75 | D: 3.20 | GP: 10.00 | SS: 0.32
LWG</content/Training/>:   0% 1745/1000000 [42:15<395:36:40,  1.43s/it]G: -0.52 | D: 3.16 | GP: 10.00 | SS: 0.28
LWG</content/Training/>:   0% 1793/1000000 [43:24<394:27:18,  1.42s/it]G: -1.22 | D: 2.72 | GP: 10.00 | SS: 0.34
LWG</content/Training/>:   0% 1809/1000000 [43:50<415:27:46,  1.50s/it]NaN detected for generator or discriminator. Loading from checkpoint #1
loading from version 0.20.2

artucalvo avatar Jun 15 '21 20:06 artucalvo

This happens when you use the AMP flag, I found the same happened in the StyleGAN2-PyTorch implementation when using the fp16 flag there, so seems that the models collapse quite quickly after initializing. Works just fine when omitting AMP, albeit slower and more memory intensive.

KingOfCashews avatar Jul 05 '21 13:07 KingOfCashews

I get the same problem without AMP; then it tries to load always from checkpoint 0 (I think this may be a logging error). The GAN in general seems to be highly unstable.

ckyleda avatar Aug 02 '21 12:08 ckyleda