latent-diffusion icon indicating copy to clipboard operation
latent-diffusion copied to clipboard

Question about training stability

Open greeneggsandyaml opened this issue 2 years ago • 14 comments

Hello, thank you so much for this wonderful paper and codebase. I am trying to reproduce the results of lsun_churches-ldm-kl-8.yaml. I have not modified any parameters in the config and I am using your pretrained first stage model.

However, some part of training is not working correctly -- the losses are not decreasing as expected.

My loss curves are below: Loss curves

Do you know what might be going wrong here? I feel like I have done something incorrectly, but I believe that I followed the instructions closely.

Thank you for your help!

greeneggsandyaml avatar Jan 18 '22 22:01 greeneggsandyaml

Update: Training nan'ed out after 20 epochs:

Output log:

...
Epoch 15, global step 3375: val/loss_simple_ema was not in top 3
Epoch 16: 100%|█| 220/220 [13:47<00:00,  3.74s/it, loss=0.798, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00455, train/losAverage Epoch time: 827.48 seconds
Average Peak memory 31916.12MiB
Epoch 16, global step 3586: val/loss_simple_ema was not in top 3
Epoch 17: 100%|█| 220/220 [13:47<00:00,  3.74s/it, loss=0.797, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00526, train/losAverage Epoch time: 827.51 seconds
Average Peak memory 31916.25MiB
Epoch 17, global step 3797: val/loss_simple_ema was not in top 3
Epoch 18: 100%|█| 220/220 [13:48<00:00,  3.75s/it, loss=0.797, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.00582, train/losAverage Epoch time: 828.39 seconds
Average Peak memory 31916.12MiB
Epoch 18, global step 4008: val/loss_simple_ema was not in top 3
Epoch 19: 100%|█| 220/220 [13:47<00:00,  3.75s/it, loss=0.798, v_num=0, train/loss_simple_step=0.800, train/loss_vlb_step=0.00434, train/losAverage Epoch time: 827.89 seconds
Average Peak memory 31916.25MiB
Epoch 19, global step 4219: val/loss_simple_ema was not in top 3
Epoch 20:  51%|▌| 112/220 [07:00<06:41,  3.72s/it, loss=nan, v_num=0, train/loss_simple_step=inf.0, train/loss_vlb_step=inf.0, train/loss_ste

Something is definitely wrong.

greeneggsandyaml avatar Jan 18 '22 22:01 greeneggsandyaml

If you could potentially post the log for one of the training runs, that would be very helpful for comparing results.

greeneggsandyaml avatar Jan 19 '22 03:01 greeneggsandyaml

hello, I met almost the same problem when training on ImageNet (the value loss is always 1.00),do you have any solution now ,thank you very much.

sunsq-blue avatar Apr 17 '22 07:04 sunsq-blue

Hi,

sorry for the late reply. I will take a guess and suggest to run the training with the following command:

python main.py --base configs/latent-diffusion/lsun_churches-ldm-kl-8.yaml -t --gpus <your gpus> --scale_lr False

This prevents scaling the learning rate by the effective batch size. If this scaling is not deactivated with --scale_lr False, the lr is likely too high and the training will collapse. Let me know if this does not solve the issue!

rromb avatar May 30 '22 21:05 rromb

Thanks, I'll give it a try and report back on whether the instability persists with --scale_lr False.

greeneggsandyaml avatar May 30 '22 21:05 greeneggsandyaml

Were you able to train the model?

naveedunjum avatar Jul 07 '22 12:07 naveedunjum

Reducing learning rate may solve the issue

sunsq-blue avatar Jul 07 '22 12:07 sunsq-blue

@naveedunjum

sunsq-blue avatar Jul 07 '22 12:07 sunsq-blue

@sunsq-blue actually I had a different question. Thought maybe if someone has trained successfully, they could help. I am not able to able sample from the trained model. i detailed the issue here: https://github.com/CompVis/latent-diffusion/issues/97

naveedunjum avatar Jul 07 '22 12:07 naveedunjum

@naveedunjum I have train on LSUN bedroom ,even with segmentation .The repo may miss some configuration files, You can complete it yourself

sunsq-blue avatar Jul 07 '22 12:07 sunsq-blue

@sunsq-blue can you please post replies on https://github.com/CompVis/latent-diffusion/issues/97.

naveedunjum avatar Jul 07 '22 12:07 naveedunjum

@rromb running 4gpu ffhq config, with scale_lr False, the loss explodes after 16k steps image

Ir1d avatar Aug 30 '22 05:08 Ir1d

Hi guys, I wanna share with u this info so it may be useful for others. Initially when I trained the model on churches datasets I got random and balck images which indicates that the model didn't learn anything. The solution proposed by @rromb ; deactivate the learning rate scaling --scale_lr False, works for me and now I have a good images after 70 epochs and the training is on going.

eslambakr avatar Jun 10 '23 10:06 eslambakr

I encountered the same issue where, at the beginning of training, the loss gradually decreased but then reached a point where the training seemed to diverge, and the loss started increasing. Since I'm using a custom dataset and some specific parameters, I suspect it might be an issue with the model's architecture or the learning rate. I tried reducing the learning rate twice, approximately to one-tenth of the initial value, and that helped resolve the problem. However, the training process seemed a bit slow.

zhixiaoni avatar Aug 13 '23 15:08 zhixiaoni