esrgan-tf2
esrgan-tf2 copied to clipboard
Loss is always nan
Hello PeteryuX,
Thanks a lot for sharing your implementation of ESRGAN.
I have been testing some of the GAN based superresolution network recently. I have got a lot of training HR/LR images and would like to train the ESRGAN (PSNR+ESRGAN) network using your training code.
I have followed your instructions on data preparation and converted my 1,825,587 pairs of LR/HR samples to *bin.tfrecord checked dataset_checker no problem, LR/HR images displayed well, modified few lines of your code for the hardcoded paths etc. and started PSNR training on the RTX3090 GPU. However, the calculated and printed out "loss" is always "nan" in every iteration, and even after "successfully" finished PSNR training, the loss_D and loss_G in ESRGAN training is also shown as "nan".
in psnr training: ... Training [>> ] 20004/600000, loss=nan, lr=2.0e-04 2.0 step/sec ...
in esrgan training: ... Training [>>> ] 40000/285240, loss_G=nan, loss_D=nan, lr_G=1.0e-04, lr_D=1.0e-04 1.4 step/sec [*] save ckpt file at ./checkpoints/esrgan/ckpt-32 Training [>>>> ] 47877/285240, loss_G=nan, loss_D=nan, lr_G=1.0e-04, lr_D=1.0e-04 1.4 step/sec ...
Do you have any suggestions on this issue?
I here attach the psnr+esrgan parameter files:
psnr.yaml: batch_size: 64 input_size: 32 gt_size: 128 ch_size: 3 scale: 4 sub_name: 'psnr_pretrain' pretrain_name: null
network_G: nf: 64 nb: 23
train_dataset: path: '/data/EOSC/EOSC_sub_bin.tfrecord' num_samples: 1825587 using_bin: True using_flip: True using_rot: True test_dataset: EOSC_path: '/data2/EOSC_test'
niter: 600000 lr: !!float 2e-4 lr_steps: [200000, 300000, 400000, 500000] lr_rate: 0.5
adam_beta1_G: 0.9 adam_beta2_G: 0.99
w_pixel: 1.0 pixel_criterion: l1 save_steps: 20000
esrgan.yaml: batch_size: 64 input_size: 32 gt_size: 128 ch_size: 3 scale: 4 sub_name: 'esrgan' pretrain_name: 'psnr_pretrain'
network_G: nf: 64 nb: 23 network_D: nf: 64
train_dataset: path: '/data/EOSC/EOSC_sub_bin.tfrecord' num_samples: 1825587 using_bin: True using_flip: False using_rot: False test_dataset: EOSC_path: '/data2/EOSC_test'
niter: 285240 lr_G: !!float 1e-4 lr_D: !!float 1e-4 lr_steps: [60000, 120000, 180000, 240000] lr_rate: 0.5
adam_beta1_G: 0.9 adam_beta2_G: 0.99 adam_beta1_D: 0.9 adam_beta2_D: 0.99
w_pixel: !!float 1e-2 pixel_criterion: l1
w_feature: 1.0 feature_criterion: l1
w_gan: !!float 5e-3 gan_type: ragan # gan | ragan
save_steps: 20000
Any help would be much appreciated! Thank you!
Did you solve it?