latent-diffusion Question about --scale

Hi, I encountered some problems when I train the unconditional LDM. I trained the LDM with 2 RTX 3090. When should I use --scale_lr True to scale the learning rate? (Actually, it's True by default....) The learning rate is scaled by accumulate_grad_batches * ngpu * bs * base_lr Why should scaled the learning rate by this? If I use batch size 48, the learning rate will become 1*2*48*0.00005, much bigger than the lr in the Paper( 0.00005). and the model won't be converged. I want to train the model with the paper settings, should I set --scaled False ?

Apr 02 '23 18:04 ader47

i have faced the same problem, and i found that the model converges well in my task while scaled is False

Apr 03 '23 08:04 Joel18241096

why do you need a batch size as big as 48? I don't think rtx3090 has enough memory.

Apr 04 '23 12:04 blusque

why do you need a batch size as big as 48? I don't think rtx3090 has enough memory.

I want to train on the LSUN_chruches dataset , and the batch size in the original paper is 96. The max batch size I tested on RTX 3090 is 52.

Apr 04 '23 12:04 ader47

Hi, I encountered some problems when I train the unconditional LDM. I trained the LDM with 2 RTX 3090. When should I use --scale_lr True to scale the learning rate? (Actually, it's True by default....) The learning rate is scaled by accumulate_grad_batches * ngpu * bs * base_lr Why should scaled the learning rate by this? If I use batch size 48, the learning rate will become 1*2*48*0.00005, much bigger than the lr in the Paper( 0.00005). and the model won't be converged. I want to train the model with the paper settings, should I set --scaled False ?

Got the same question

Nov 04 '23 02:11 haooxia

I also encountered a situation where the model was unable to converge. I kept the learning rate constant at 5e-5 and it seemed that it could not converge.

Dec 09 '23 15:12 clearlyzero

I also encountered a situation where the model was unable to converge. I kept the learning rate constant at 5e-5 and it seemed that it could not converge.

The loss will be fluctuated around about 0.2

Dec 09 '23 15:12 ader47

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右波动

In my experiment, the loss is around 0.4. Can it be understood as convergence around 0.2?

Dec 09 '23 15:12 clearlyzero

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右波动

In my experiment, the loss is around 0.4. Can it be understood as convergence around 0.2?

You kept the provided settings or you changed the settings？I kept the settings and the loss is around 0.2, but I forgot on which dataset. In my experiment only the LSUN_Churches should set the scale_lr False，others can be True.

Dec 09 '23 15:12 ader47

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右波动

In my experiment, the loss is around 0.4. Can it be understood as convergence around 0.2?

You kept the provided settings or you changed the settings？I kept the settings and the loss is around 0.2, but I forgot on which dataset. In my experiment only the LSUN_Churches should set the scale_lr False，others can be True.

If set to FALSE the lr = n_gpus*0.00005?

Dec 09 '23 15:12 clearlyzero

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右波动

In my experiment, the loss is around 0.4. Can it be understood as convergence around 0.2?

You kept the provided settings or you changed the settings？I kept the settings and the loss is around 0.2, but I forgot on which dataset. In my experiment only the LSUN_Churches should set the scale_lr False，others can be True.

If set to FALSE the lr = n_gpus*0.00005?

No, the lr=0.00005 and I remember the code sets a lr linear scheduler, so the lr will increase from 0 to 0.00005 in 10000 steps and then keep constant.

Dec 09 '23 15:12 ader47

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右

在我的实验中，损失约为 0.4。可以理解为收敛在0.2左右吗？

您保留了提供的设置还是更改了设置？我保留了设置，损失约为 0.2，但我忘记了在哪个数据集上。在我的实验中，只有LSUN_Churches应该将scale_lr设置为False，其他可以设置为True。

如果设置为 FALSE lr = n_gpus*0.00005？

不，lr=0.00005，我记得代码设置了一个lr线性调度器，因此lr将以10000步从0增加到0.00005，然后保持不变。

Thank you for your reply, I have a general understanding,I will try it later.

Dec 09 '23 15:12 clearlyzero

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右

在我的实验中，损失约为 0.4。可以理解为收敛在0.2左右吗？

您保留了提供的设置还是更改了设置？我保留了设置，损失约为 0.2，但我忘记了在哪个数据集上。在我的实验中，只有LSUN_Churches应该将scale_lr设置为False，其他可以设置为True。

如果设置为 FALSE lr = n_gpus*0.00005？

不，lr=0.00005，我记得代码设置了一个lr线性调度器，因此lr将以10000步从0增加到0.00005，然后保持不变。

Thank you for your reply, I have a general understanding,I will try it later.

The FID in paper could not be reproduced using the own trained ckpt😭

Dec 09 '23 15:12 ader47

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右

在我的实验中，损失约为 0.4。可以理解为收敛在0.2左右吗？

您保留了提供的设置还是更改了设置？我保留了设置，损失约为 0.2，但我忘记了在哪个数据集上。在我的实验中，只有LSUN_Churches应该将scale_lr设置为False，其他可以设置为True。

如果设置为 FALSE lr = n_gpus*0.00005？

不，lr=0.00005，我记得代码设置了一个lr线性调度器，因此lr将以10000步从0增加到0.00005，然后保持不变。

Thank you for your reply, I have a general understanding,I will try it later.

The FID in paper could not be reproduced using the own trained ckpt😭

Are the checkpoints provided also impossible to reproduce?

Dec 09 '23 16:12 clearlyzero

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右

在我的实验中，损失约为 0.4。可以理解为收敛在0.2左右吗？

您保留了提供的设置还是更改了设置？我保留了设置，损失约为 0.2，但我忘记了在哪个数据集上。在我的实验中，只有LSUN_Churches应该将scale_lr设置为False，其他可以设置为True。

如果设置为 FALSE lr = n_gpus*0.00005？

不，lr=0.00005，我记得代码设置了一个lr线性调度器，因此lr将以10000步从0增加到0.00005，然后保持不变。

Thank you for your reply, I have a general understanding,I will try it later.

The FID in paper could not be reproduced using the own trained ckpt😭

Are the checkpoints provided also impossible to reproduce?

No you can reproduce the FID using the provided ckpt, but using the own trained ckpt the FID cloud not be reproduced

Dec 09 '23 16:12 ader47

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右

在我的实验中，损失约为 0.4。可以理解为收敛在0.2左右吗？

您保留了提供的设置还是更改了设置？我保留了设置，损失约为 0.2，但我忘记了在哪个数据集上。在我的实验中，只有LSUN_Churches应该将scale_lr设置为False，其他可以设置为True。

如果设置为 FALSE lr = n_gpus*0.00005？

不，lr=0.00005，我记得代码设置了一个lr线性调度器，因此lr将以10000步从0增加到0.00005，然后保持不变。

Thank you for your reply, I have a general understanding,I will try it later.

The FID in paper could not be reproduced using the own trained ckpt😭

Are the checkpoints provided also impossible to reproduce?

No you can reproduce the FID using the provided ckpt, but using the own trained ckpt the FID cloud not be reproduced

This is indeed very complicated and difficult. I am still training a very simple data set and have not applied it yet.

Dec 09 '23 16:12 clearlyzero

我也遇到过模型无法收敛的情况。我将学习率保持在5e-5不变，看起来无法收敛。

损失会在0.2左右

在我的实验中，损失约为 0.4。可以理解为收敛在0.2左右吗？

您保留了提供的设置还是更改了设置？我保留了设置，损失约为 0.2，但我忘记了在哪个数据集上。在我的实验中，只有LSUN_Churches应该将scale_lr设置为False，其他可以设置为True。

如果设置为 FALSE lr = n_gpus*0.00005？

不，lr=0.00005，我记得代码设置了一个lr线性调度器，因此lr将以10000步从0增加到0.00005，然后保持不变。

Thank you for your reply, I have a general understanding,I will try it later.

The FID in paper could not be reproduced using the own trained ckpt😭

Are the checkpoints provided also impossible to reproduce?

No you can reproduce the FID using the provided ckpt, but using the own trained ckpt the FID cloud not be reproduced

This is indeed very complicated and difficult. I am still training a very simple data set and have not applied it yet.

Good luck 👍

Dec 09 '23 16:12 ader47

May I ask if the latent space size of the autokl encoder used in the LSUN_Churches data set is 32x32x4? I am currently using 64x64x3. I wonder if this is the reason why I have suffered a lot of losses.

Dec 10 '23 01:12 clearlyzero

May I ask if the latent space size of the autokl encoder used in the LSUN_Churches data set is 32x32x4? I am currently using 64x64x3. I wonder if this is the reason why I have suffered a lot of losses.

yes，because it is kl-f8，f8 means it compress 8 times of the spatial size

Dec 10 '23 05:12 ader47

May I ask if the latent space size of the autokl encoder used in the LSUN_Churches data set is 32x32x4? I am currently using 64x64x3. I wonder if this is the reason why I have suffered a lot of losses.

yes，because it is kl-f8，f8 means it compress 8 times of the spatial size

I can now use the encoder to compress the image and then diffuse it and generate some images that are not that good quality. I use a very small Unet😂

Dec 11 '23 04:12 clearlyzero

May I ask if the latent space size of the autokl encoder used in the LSUN_Churches data set is 32x32x4? I am currently using 64x64x3. I wonder if this is the reason why I have suffered a lot of losses.

But I think this does not matter, cause that the higher the compression ratio is, the lower the quaility of generative resultes is. Actually 64x64x3 has a lower compression ratio than that of 32x32x4.

Apr 02 '24 14:04 Ly403

latent-diffusion
latent-diffusion copied to clipboard

Question about --scale_lr

latent-diffusion latent-diffusion copied to clipboard

Question about --scale_lr

latent-diffusion
latent-diffusion copied to clipboard