latent-diffusion
latent-diffusion copied to clipboard
Reproduction problem while training inpainting model
Thanks for the good work. I am trying to reproduce the diffusion model upon image inpainting task. The configuration file I uses is modified from models/ldm/inpainting_big/config.yaml
. But the loss curve apppears to be quite weird. It converges too fast just after the warmup ends.
(Note that the warmup steps is 1000. The loss function value has come to a pretty low value at 1000 steps.)
Also, the inpainting result are poor in quality. This is one of my own test data. (trained on FFHQ dataset)

Does anyone encounter the same problem? I feel like this might be caused by a learning rate issue. Please answer and fix this problem. Thank you very much!
could you give me an example of your dataloader? I am using the same config file giving the masked image as 3 channels and the image as 3 channels as well but I am getting this error
RuntimeError: Given groups=1, weight of size [256, 7, 3, 3], expected input[4, 6, 64, 64] to have 7 channels, but got 6 channels instead
You need to modify correpsonding part in ddpm.py
. I solve the problem by concatenating mask
, masked_image
and image
. Then the input has 7 channels as the configuration file gives. However it still seems hard to reproduce the official result. I have no idea how long the author has trained the model for. I have trained to entire 3 days and the inpainted result is still blur.
Do u mean in the dataloader, masked_image will have masked_image, mask and image?
or if u mean ddpm.py
could u specify where please.
I fixed my problem, and after training I got the same output as you, just noise in the masked parts.
Hi guys, same problem here.
Do u mean in the dataloader, masked_image will have masked_image, mask and image? or if u mean
ddpm.py
could u specify where please.
Hi, could you tell me where to change? Thx a lot!
I suggest you spend some time understanding the whole codebase. It would be a lot easier if you understand the process of how Stable Diffusion works and how they implement this process. Althout it might take a while.
Generally speaking, you could modify ldm/models/diffusion/ddpm.py
according to the script scripts/inpaint.py
, which would be used while inference. In inpaint.py, line 79
we could see that the mask
in each batch is downsampled to the same size as we pass the masked image
through VQ model. It is implemented using nn.interpolate()
and then concatenated with the encoded masked image
. We should also keep the same mode of adding mask
while we are training. So, the number of input channels of the whole U-Net should be 7 channels (image, 3 channels
+ masked image, 3 channels
+ mask, 1 channel
= 7 channels), and mask
and masked image
should be concatenated with the input image
in the same way while we are inference. In this manner, modifying correpsonding lines in ddpm.py
should be able to work.
Thx for your reply, i've solved the problem already.
Hi @AlonzoLeeeooo,
Any update on your progress? Were you able to achieve good inpainting results on your custom dataset? If so, it would be great if you could share your training pipeline/ configurations.
Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.
How about finetuning the provided inpainting_big checkpoint instead of training from scratch? Have you experimented with that? Do you think that might output good results on a custom dataset?
Also, are you aware of which dataset inpainting_big checkpoint has been trained on?
How about finetuning the provided inpainting_big checkpoint instead of training from scratch? Have you experimented with that? Do you think that might output good results on a custom dataset?
Also, are you aware of which dataset inpainting_big checkpoint has been trained on?
I haven't tried finetuning yet. But the idea should be able to work, theoretically. The reported training set is Places2 Standard
. It is worth mentioning that using the provided inpainting_big
is able to produce plausible results on most natural image cases. Maybe you could try it out.
I see, thanks! Well, I'll look into how I can try finetuning the inpainting checkpoint.
Can you kindly point me to the reference reporting that Places2 Standard
dataset has been used for the inpainting model training? I'm unable to find that.
I see, thanks! Well, I'll look into how I can try finetuning the inpainting checkpoint.
Can you kindly point me to the reference reporting that
Places2 Standard
dataset has been used for the inpainting model training? I'm unable to find that.
It is at Table 15
in their supplementary materials. As for the supplementary materials, you could refer to https://openaccess.thecvf.com/content/CVPR2022/supplemental/Rombach_High-Resolution_Image_Synthesis_CVPR_2022_supplemental.pdf.
You need to modify correpsonding part in
ddpm.py
. I solve the problem by concatenatingmask
,masked_image
andimage
. Then the input has 7 channels as the configuration file gives. However it still seems hard to reproduce the official result. I have no idea how long the author has trained the model for. I have trained to entire 3 days and the inpainted result is still blur.
Could you please tell me what encoder you use as cond_stage_config for training the inpainting model?
Hi @zaryabmakram ,
I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.
你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!
Hi @zaryabmakram , I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.
你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!
你好@DongyangHuLi,
我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。
Hi @zaryabmakram , I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.
你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!
你好@DongyangHuLi,
我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。
谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔
Hi @zaryabmakram , I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.
你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!
你好@DongyangHuLi, 我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。
谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔
是的,基本就是烧钱才能做的research🙍♂️
Thanks for the good work. I am trying to reproduce the diffusion model upon image inpainting task. The configuration file I uses is modified from
models/ldm/inpainting_big/config.yaml
. But the loss curve apppears to be quite weird. It converges too fast just after the warmup ends.(Note that the warmup steps is 1000. The loss function value has come to a pretty low value at 1000 steps.)
Also, the inpainting result are poor in quality. This is one of my own test data. (trained on FFHQ dataset)
![]()
Does anyone encounter the same problem? I feel like this might be caused by a learning rate issue. Please answer and fix this problem. Thank you very much!
Could you share the inpainting training code? Thank you!
Hi @zaryabmakram , I didn't successfully re-train the diffusion model. The results are always blur even if the model is trained for 3 days. Empirically this is caused by insufficient training. Afterwards, I notice that the reported GPU requirement in the supplementary materials of Stable Diffusion is 8 V100 GPUs. Due to limited computational resources, I have to give up the idea of reproducing it.
你好!我对你的训练细节很感兴趣。 请问你训练了3天,使用了多大的batchsize,一共占用了多少显存,以及训练了多少个epoch呢? 我还没有进行复现,但我对latent-diffusion文中所说的”能够减少显存开销“很感兴趣,他真的能够通过latent space来达到减少显存的效果吗? 期待你的回复!
你好@DongyangHuLi, 我设置的batch size是48,但是为了节省显存,我将model channels降低到了128,训练了应该有600k次迭代次数,使用的是两张3090来做训练。 对于latent-diffusion所说的“减少显存开销”的说法,应该是相比于DDPM说的,实际上需要的显卡需求依然不小,对于inpainting的setting来说,依然需要8张Tesla v100来训练。所以可以说我自身的算力条件依然不足以复现原论文的效果。以上供参考,希望能帮到你。
谢谢!那这么说,扩散模型没有足够的硬件资源是很难work得了了😔
是的,基本就是烧钱才能做的research🙍♂️
兄弟也是ustc的?加个微信交流一下?我微信号Kiss_The_Rain8,麻烦加一下哈,向你请教一下
Hi @AlonzoLeeeooo
Do I need to finetune the autoencoder separately (stage 1) on my custom dataset and then finetune the inpainting_big model by modifying the input in ddpm.py
as in inpaint.py
(stage 2) on my custom dataset? Or only Stage 2 would work. Please help.
Hi @AlonzoLeeeooo Do I need to finetune the autoencoder separately (stage 1) on my custom dataset and then finetune the inpainting_big model by modifying the input in
ddpm.py
as ininpaint.py
(stage 2) on my custom dataset? Or only Stage 2 would work. Please help.
Hi @rayush7 , As far as I am concerned, you don't need to tune the model parameters of the VQ model (stage 1). Since the official one is trained on open images dataset, it should be sufficient to encode most of the images. Only finetuning stage 2 should be able to work.
Thank you @AlonzoLeeeooo I will give it a try.
how to prepare the data for inpaint?
The papers mentions that the data preparation step is same as in LaMa. https://github.com/advimman/lama
I suggest you spend some time understanding the whole codebase. It would be a lot easier if you understand the process of how Stable Diffusion works and how they implement this process. Althout it might take a while.
Generally speaking, you could modify
ldm/models/diffusion/ddpm.py
according to the scriptscripts/inpaint.py
, which would be used while inference. Ininpaint.py, line 79
we could see that themask
in each batch is downsampled to the same size as we pass themasked image
through VQ model. It is implemented usingnn.interpolate()
and then concatenated with the encodedmasked image
. We should also keep the same mode of addingmask
while we are training. So, the number of input channels of the whole U-Net should be 7 channels (image, 3 channels
+masked image, 3 channels
+mask, 1 channel
= 7 channels), andmask
andmasked image
should be concatenated with the inputimage
in the same way while we are inference. In this manner, modifying correpsonding lines inddpm.py
should be able to work.
Could I have a look at your modified code for this part, thank you very much if I could ! ! !
Hi @mumingerlai ,
I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.
But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.
Regards, Chang
Hi @mumingerlai ,
I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor.
But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer.
Regards, Chang
I feel very happy for your reply. I have modified inpainting. py and concated images, masked images, and masks. It seems that they can also run normally! Anyway, thank you very much!
Hi @mumingerlai , I really would like to help but since I was working on another project related to the same codebase, a huge amount of modifications upon the codebase have been made and it would be quite difficult for me to retrieve the corresponding parts of inpainting. Sorry for not able to do you the favor. But if there is any other problem about the modification, please feel free to discuss in this issue and I would try my best to recall and answer. Regards, Chang
I feel very happy for your reply. I have modified inpainting. py and concated images, masked images, and masks. It seems that they can also run normally! Anyway, thank you very much!
Hello, could I have a look at your modification method and data config for inpainting train? These tasks are difficult for me. Thank you very much!