esrgan-tf2 icon indicating copy to clipboard operation
esrgan-tf2 copied to clipboard

Loss is always nan

Open taoyu17 opened this issue 3 years ago • 1 comments

Hello PeteryuX,

Thanks a lot for sharing your implementation of ESRGAN.

I have been testing some of the GAN based superresolution network recently. I have got a lot of training HR/LR images and would like to train the ESRGAN (PSNR+ESRGAN) network using your training code.

I have followed your instructions on data preparation and converted my 1,825,587 pairs of LR/HR samples to *bin.tfrecord checked dataset_checker no problem, LR/HR images displayed well, modified few lines of your code for the hardcoded paths etc. and started PSNR training on the RTX3090 GPU. However, the calculated and printed out "loss" is always "nan" in every iteration, and even after "successfully" finished PSNR training, the loss_D and loss_G in ESRGAN training is also shown as "nan".

in psnr training: ... Training [>> ] 20004/600000, loss=nan, lr=2.0e-04 2.0 step/sec ...

in esrgan training: ... Training [>>> ] 40000/285240, loss_G=nan, loss_D=nan, lr_G=1.0e-04, lr_D=1.0e-04 1.4 step/sec [*] save ckpt file at ./checkpoints/esrgan/ckpt-32 Training [>>>> ] 47877/285240, loss_G=nan, loss_D=nan, lr_G=1.0e-04, lr_D=1.0e-04 1.4 step/sec ...

Do you have any suggestions on this issue?

I here attach the psnr+esrgan parameter files:

psnr.yaml: batch_size: 64 input_size: 32 gt_size: 128 ch_size: 3 scale: 4 sub_name: 'psnr_pretrain' pretrain_name: null

network_G: nf: 64 nb: 23

train_dataset: path: '/data/EOSC/EOSC_sub_bin.tfrecord' num_samples: 1825587 using_bin: True using_flip: True using_rot: True test_dataset: EOSC_path: '/data2/EOSC_test'

niter: 600000 lr: !!float 2e-4 lr_steps: [200000, 300000, 400000, 500000] lr_rate: 0.5

adam_beta1_G: 0.9 adam_beta2_G: 0.99

w_pixel: 1.0 pixel_criterion: l1 save_steps: 20000

esrgan.yaml: batch_size: 64 input_size: 32 gt_size: 128 ch_size: 3 scale: 4 sub_name: 'esrgan' pretrain_name: 'psnr_pretrain'

network_G: nf: 64 nb: 23 network_D: nf: 64

train_dataset: path: '/data/EOSC/EOSC_sub_bin.tfrecord' num_samples: 1825587 using_bin: True using_flip: False using_rot: False test_dataset: EOSC_path: '/data2/EOSC_test'

niter: 285240 lr_G: !!float 1e-4 lr_D: !!float 1e-4 lr_steps: [60000, 120000, 180000, 240000] lr_rate: 0.5

adam_beta1_G: 0.9 adam_beta2_G: 0.99 adam_beta1_D: 0.9 adam_beta2_D: 0.99

w_pixel: !!float 1e-2 pixel_criterion: l1

w_feature: 1.0 feature_criterion: l1

w_gan: !!float 5e-3 gan_type: ragan # gan | ragan

save_steps: 20000

Any help would be much appreciated! Thank you!

taoyu17 avatar Oct 27 '20 15:10 taoyu17

Did you solve it?

Lfywx avatar Apr 11 '22 07:04 Lfywx