latent-diffusion
latent-diffusion copied to clipboard
Question about training stability
Hello, thank you so much for this wonderful paper and codebase. I am trying to reproduce the results of lsun_churches-ldm-kl-8.yaml
. I have not modified any parameters in the config and I am using your pretrained first stage model.
However, some part of training is not working correctly -- the losses are not decreasing as expected.
My loss curves are below:
Do you know what might be going wrong here? I feel like I have done something incorrectly, but I believe that I followed the instructions closely.
Thank you for your help!
Update: Training nan'ed out after 20 epochs:
Output log:
...
Epoch 15, global step 3375: val/loss_simple_ema was not in top 3
Epoch 16: 100%|█| 220/220 [13:47<00:00, 3.74s/it, loss=0.798, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00455, train/losAverage Epoch time: 827.48 seconds
Average Peak memory 31916.12MiB
Epoch 16, global step 3586: val/loss_simple_ema was not in top 3
Epoch 17: 100%|█| 220/220 [13:47<00:00, 3.74s/it, loss=0.797, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00526, train/losAverage Epoch time: 827.51 seconds
Average Peak memory 31916.25MiB
Epoch 17, global step 3797: val/loss_simple_ema was not in top 3
Epoch 18: 100%|█| 220/220 [13:48<00:00, 3.75s/it, loss=0.797, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.00582, train/losAverage Epoch time: 828.39 seconds
Average Peak memory 31916.12MiB
Epoch 18, global step 4008: val/loss_simple_ema was not in top 3
Epoch 19: 100%|█| 220/220 [13:47<00:00, 3.75s/it, loss=0.798, v_num=0, train/loss_simple_step=0.800, train/loss_vlb_step=0.00434, train/losAverage Epoch time: 827.89 seconds
Average Peak memory 31916.25MiB
Epoch 19, global step 4219: val/loss_simple_ema was not in top 3
Epoch 20: 51%|▌| 112/220 [07:00<06:41, 3.72s/it, loss=nan, v_num=0, train/loss_simple_step=inf.0, train/loss_vlb_step=inf.0, train/loss_ste
Something is definitely wrong.
If you could potentially post the log for one of the training runs, that would be very helpful for comparing results.
hello, I met almost the same problem when training on ImageNet (the value loss is always 1.00),do you have any solution now ,thank you very much.
Hi,
sorry for the late reply. I will take a guess and suggest to run the training with the following command:
python main.py --base configs/latent-diffusion/lsun_churches-ldm-kl-8.yaml -t --gpus <your gpus> --scale_lr False
This prevents scaling the learning rate by the effective batch size. If this scaling is not deactivated with --scale_lr False
, the lr is likely too high and the training will collapse. Let me know if this does not solve the issue!
Thanks, I'll give it a try and report back on whether the instability persists with --scale_lr False
.
Were you able to train the model?
Reducing learning rate may solve the issue
@sunsq-blue actually I had a different question. Thought maybe if someone has trained successfully, they could help. I am not able to able sample from the trained model. i detailed the issue here: https://github.com/CompVis/latent-diffusion/issues/97
@naveedunjum I have train on LSUN bedroom ,even with segmentation .The repo may miss some configuration files, You can complete it yourself
@sunsq-blue can you please post replies on https://github.com/CompVis/latent-diffusion/issues/97.
@rromb running 4gpu ffhq config, with scale_lr False
, the loss explodes after 16k steps
Hi guys,
I wanna share with u this info so it may be useful for others.
Initially when I trained the model on churches datasets I got random and balck images which indicates that the model didn't learn anything.
The solution proposed by @rromb ; deactivate the learning rate scaling --scale_lr False
, works for me and now I have a good images after 70 epochs and the training is on going.
I encountered the same issue where, at the beginning of training, the loss gradually decreased but then reached a point where the training seemed to diverge, and the loss started increasing. Since I'm using a custom dataset and some specific parameters, I suspect it might be an issue with the model's architecture or the learning rate. I tried reducing the learning rate twice, approximately to one-tenth of the initial value, and that helped resolve the problem. However, the training process seemed a bit slow.