latent-diffusion icon indicating copy to clipboard operation
latent-diffusion copied to clipboard

The stablility of training

Open liuchanglab opened this issue 2 years ago • 2 comments

We use our data which only contains face. However, when we train ldm, we find the loss does not degrease. The loss ==0.798

Epoch 5: 8%|▊ | 107/1422 [00:57<11:36, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.800, train/loss_vlb_step=0.0366, train/loss_step=0.800, global_step=6675.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 107/1422 [00:57<11:36, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.00406, train/loss_step=0.799, global_step=6676.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 108/1422 [00:57<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.00406, train/loss_step=0.799, global_step=6676.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 108/1422 [00:57<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.0137, train/loss_step=0.799, global_step=6677.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 109/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.0137, train/loss_step=0.799, global_step=6677.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 109/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.795, train/loss_vlb_step=0.00617, train/loss_step=0.795, global_step=6678.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 110/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.795, train/loss_vlb_step=0.00617, train/loss_step=0.795, global_step=6678.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 110/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.798, train/loss_vlb_step=0.00474, train/loss_step=0.798, global_step=6679.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 111/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.798, train/loss_vlb_step=0.00474, train/loss_step=0.798, global_step=6679.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 111/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.796, train/loss_vlb_step=0.00421, train/loss_step=0.796, global_step=6680.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 112/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.796, train/loss_vlb_step=0.00421, train/loss_step=0.796, global_step=6680.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 112/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00464, train/loss_step=0.797, global_step=6681.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 113/1422 [01:00<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00464, train/loss_step=0.797, global_step=6681.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748] Epoch 5: 8%|▊ | 113/1422 [01:00<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.794, train/loss_vlb_step=0.00373, train/loss_step=0.794, global_step=6682.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]

Thanks for comments

liuchanglab avatar Mar 18 '22 02:03 liuchanglab

hello, I met almost the same problem when training on ImageNet (the value loss is always 1.00),do you have any solution now ,thank you very much.

sunsq-blue avatar Apr 17 '22 07:04 sunsq-blue

I have changed the optimizer from Adamw to adam. It will alleviate the problem of unstable training. May it help you.

liuchanglab avatar Apr 18 '22 02:04 liuchanglab