Palette-Image-to-Image-Diffusion-Models icon indicating copy to clipboard operation
Palette-Image-to-Image-Diffusion-Models copied to clipboard

Colorization training isn't working

Open omerb01 opened this issue 1 year ago • 24 comments

I downloaded the flicker25k dataset, preprocessed it and train a model with these modifications in the config file:

  • batch size of 256 among 4 GPUs (thus total batch size of 1024)
  • image resolution 64x64

The rest of the configurations remained as in the current config file. Even after 1000 training epochs, the model still produces bad results.

Is there anything I'm missing? Thanks.

omerb01 avatar Sep 01 '22 06:09 omerb01

I'm also experiencing the same issues. Are your results also very unsaturated?

xenova avatar Sep 02 '22 01:09 xenova

I'm not sure if you have tried this, but what about setting "clip_denoised" to False (instead of True, which is the default)? It might result in more saturated results.

^ I will try this for my task and let you know how it goes

xenova avatar Sep 02 '22 16:09 xenova

Thanks @xenova , waiting for your update

omerb01 avatar Sep 02 '22 21:09 omerb01

After training for another 3 hours with clip_denoised=False, I haven't seen any improvement. Perhaps @Janspiry can provide some extra assistance.

xenova avatar Sep 02 '22 22:09 xenova

@xenova @omerb01 Hello, did you solve the issue? I am still having problems in colorization task.

ksunho9508 avatar Sep 16 '22 11:09 ksunho9508

@xenova @omerb01 Hello, did you solve the issue? I am still having problems in colorization task.

Nope, still struggling with colorization

xenova avatar Sep 16 '22 11:09 xenova

@ksunho9508 @xenova I am still unable to obtain reliable results. In my opinion, the flicker dataset does not contain enough data to generalize this task via diffusion based methods. The authors of the original paper applied their method to the ImageNet dataset, which contains much more training data.

omerb01 avatar Sep 16 '22 12:09 omerb01

Hi, guys, sorry for this problem.

Like @omerb01 said, I share the same view that the flicker dataset is too small to coloration for natural scenes. May you should do this task in the ImageNet or Places2. More information can be found in #17

Janspiry avatar Sep 16 '22 13:09 Janspiry

@Janspiry I've also tried on my custom dataset (with millions of images), and I get the same results :/ ... I'm really not sure how this is the only task that is facing these issues; all other tasks seem to work fine.

xenova avatar Sep 16 '22 13:09 xenova

@xenova I'll make sure there are no bugs in the coloration part of the code

Janspiry avatar Sep 16 '22 14:09 Janspiry

@Janspiry Thank you. And can you add config file of super resolution too?

KSH0660 avatar Sep 16 '22 15:09 KSH0660

I also found this problem. I used my own small-scale data set to train it, but still failed to get results after many epoch。 @Janspiry

edcson avatar Nov 22 '22 06:11 edcson

@omerb01 Have you tried running experiments under the same conditions after changing GroupNorm to BatchNorm? It seems that using BatchNorm instead of GroupNorm can perform colorization to some extent by distinguishing between the background and objects.

kkamankun avatar Mar 08 '23 08:03 kkamankun

I experienced the same problem. BTW, have you guys checked the training log? According to mine, it seems that the network is sufferred from a severe overfiting: ''' INFO: Begin model train. INFO: train/mse_loss: 0.1167483588039875 INFO: train/mse_loss: 0.0724316855113022 INFO: train/mse_loss: 0.06527451830048543 INFO: epoch: 1 INFO: iters: 23488 INFO: train/mse_loss: 0.020401993506137254 INFO: train/mse_loss: 0.018878939009419112 INFO: train/mse_loss: 0.018366146380821978 INFO: epoch: 2 INFO: iters: 46976 INFO: train/mse_loss: 0.014938667484635498 INFO: train/mse_loss: 0.0148746125182753 INFO: train/mse_loss: 0.014505743447781326 INFO: train/mse_loss: 0.014465472793432741 INFO: epoch: 3 INFO: iters: 70464 INFO: train/mse_loss: 0.014389766222024227 INFO: train/mse_loss: 0.013453237237986066 INFO: train/mse_loss: 0.013306563555842919 INFO: epoch: 4 INFO: iters: 93952 INFO: train/mse_loss: 0.012647044245178611 INFO: train/mse_loss: 0.012807737045385967 INFO: train/mse_loss: 0.011968838741840434 INFO: epoch: 5 INFO: iters: 117440 INFO:

------------------------------Validation Start------------------------------ INFO: val/mae: 0.3139403760433197 INFO: ------------------------------Validation End------------------------------

INFO: train/mse_loss: 0.011829124199711352 INFO: epoch: 6 INFO: iters: 140938 INFO: train/mse_loss: 0.010201521161369924 INFO: epoch: 7 INFO: iters: 164426 INFO: train/mse_loss: 0.010018873226117376 INFO: epoch: 8 INFO: iters: 187914 INFO: train/mse_loss: 0.009995935927926308 INFO: epoch: 9 INFO: iters: 211402 INFO: train/mse_loss: 0.009544536813287326 INFO: epoch: 10 INFO: iters: 234890 INFO: Saving the self at the end of epoch 10 INFO:

------------------------------Validation Start------------------------------ INFO: val/mae: 0.43820616602897644 INFO: ------------------------------Validation End------------------------------ '''

AlanZhang1995 avatar Sep 07 '23 00:09 AlanZhang1995

I experienced the same problem. BTW, have you guys checked the training log? According to mine, it seems that the network is sufferred from a severe overfiting: ''' INFO: Begin model train. INFO: train/mse_loss: 0.1167483588039875 INFO: train/mse_loss: 0.0724316855113022 INFO: train/mse_loss: 0.06527451830048543 INFO: epoch: 1 INFO: iters: 23488 INFO: train/mse_loss: 0.020401993506137254 INFO: train/mse_loss: 0.018878939009419112 INFO: train/mse_loss: 0.018366146380821978 INFO: epoch: 2 INFO: iters: 46976 INFO: train/mse_loss: 0.014938667484635498 INFO: train/mse_loss: 0.0148746125182753 INFO: train/mse_loss: 0.014505743447781326 INFO: train/mse_loss: 0.014465472793432741 INFO: epoch: 3 INFO: iters: 70464 INFO: train/mse_loss: 0.014389766222024227 INFO: train/mse_loss: 0.013453237237986066 INFO: train/mse_loss: 0.013306563555842919 INFO: epoch: 4 INFO: iters: 93952 INFO: train/mse_loss: 0.012647044245178611 INFO: train/mse_loss: 0.012807737045385967 INFO: train/mse_loss: 0.011968838741840434 INFO: epoch: 5 INFO: iters: 117440 INFO:

------------------------------Validation Start------------------------------ INFO: val/mae: 0.3139403760433197 INFO: ------------------------------Validation End------------------------------

INFO: train/mse_loss: 0.011829124199711352 INFO: epoch: 6 INFO: iters: 140938 INFO: train/mse_loss: 0.010201521161369924 INFO: epoch: 7 INFO: iters: 164426 INFO: train/mse_loss: 0.010018873226117376 INFO: epoch: 8 INFO: iters: 187914 INFO: train/mse_loss: 0.009995935927926308 INFO: epoch: 9 INFO: iters: 211402 INFO: train/mse_loss: 0.009544536813287326 INFO: epoch: 10 INFO: iters: 234890 INFO: Saving the self at the end of epoch 10 INFO:

------------------------------Validation Start------------------------------ INFO: val/mae: 0.43820616602897644 INFO: ------------------------------Validation End------------------------------ '''

没有,扩散模型的损失函数计算是计算噪声和预测噪声间的mse_loss,详见:https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models/issues/26#issue -1282232897。而且扩散模型的推理也存在很大的随机性,出现这种情况很正常

23-09-03 04:09:21.974 - INFO: train/mse_loss: 0.004320403648868778 23-09-03 04:09:21.974 - INFO: epoch: 1423 23-09-03 04:09:21.974 - INFO: iters: 2072372 23-09-03 04:09:21.974 - INFO: Saving the self at the end of epoch 1423 23-09-03 04:09:23.265 - INFO:

------------------------------Validation Start------------------------------ 23-09-03 04:20:12.848 - INFO: val/1-ssim: 0.1557578444480896 23-09-03 04:20:12.848 - INFO: ------------------------------Validation End------------------------------

23-09-03 04:23:16.320 - INFO: train/mse_loss: 0.004661682129078468 23-09-03 04:23:16.320 - INFO: epoch: 1424 23-09-03 04:23:16.320 - INFO: iters: 2073832 23-09-03 04:23:16.320 - INFO: Saving the self at the end of epoch 1424 23-09-03 04:23:17.622 - INFO:

------------------------------Validation Start------------------------------ 23-09-03 04:34:06.690 - INFO: val/1-ssim: 0.10180902481079102 23-09-03 04:34:06.690 - INFO: ------------------------------Validation End------------------------------

23-09-03 04:37:05.177 - INFO: train/mse_loss: 0.004233014806692961 23-09-03 04:37:05.177 - INFO: epoch: 1425 23-09-03 04:37:05.177 - INFO: iters: 2075292 23-09-03 04:37:05.177 - INFO: Saving the self at the end of epoch 1425 23-09-03 04:37:06.475 - INFO:

------------------------------Validation Start------------------------------ 23-09-03 04:47:56.020 - INFO: val/1-ssim: 0.1559600830078125 23-09-03 04:47:56.020 - INFO: ------------------------------Validation End------------------------------

23-09-03 04:50:55.078 - INFO: train/mse_loss: 0.004784488215476157 23-09-03 04:50:55.078 - INFO: epoch: 1426 23-09-03 04:50:55.078 - INFO: iters: 2076752 23-09-03 04:50:55.078 - INFO: Saving the self at the end of epoch 1426 23-09-03 04:50:56.547 - INFO:

------------------------------Validation Start------------------------------ 23-09-03 05:01:45.988 - INFO: val/1-ssim: 0.06806707382202148

1228967342 avatar Sep 15 '23 05:09 1228967342

您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几

yuanc3 avatar Sep 22 '23 09:09 yuanc3

For me, it happens as well. The training loss decreases very quickly and drops to 0.02 after 5 epochs, but the validation result is bad as hell. Someone has an idea?

TumVink avatar Sep 27 '23 14:09 TumVink

@yuanc3 @1228967342 @AlanZhang1995

TumVink avatar Sep 27 '23 14:09 TumVink

您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几

I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training.

TumVink avatar Oct 10 '23 18:10 TumVink

我最开始训练效果也不好,用默认的mse损失函数训练1200epoch着色的图片会偏色严重,后来换了一个损失函数好一些,但是没有碰到val上只能生成噪声的情况。附带一些效果不好的val图片

我的训练集只有1500张,但是更换损失函数之后在测试集上的效果还可以,10k张图应该没那么容易过拟合,可以试试用训练的图片跑试试,可能连训练集上都没办法取得很好的着色效果,建议换一下损失函数试试,我目前的训练效果还可以 ------------------ 原始邮件 ------------------ 发件人: "Janspiry/Palette-Image-to-Image-Diffusion-Models" @.>; 发送时间: 2023年10月11日(星期三) 凌晨2:44 @.>; @.@.>; 主题: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37)

您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几

I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

1228967342 avatar Oct 21 '23 13:10 1228967342

我最开始训练效果也不好,用默认的mse损失函数训练1200epoch着色的图片会偏色严重,后来换了一个损失函数好一些,但是没有碰到val上只能生成噪声的情况。附带一些效果不好的val图片 我的训练集只有1500张,但是更换损失函数之后在测试集上的效果还可以,10k张图应该没那么容易过拟合,可以试试用训练的图片跑试试,可能连训练集上都没办法取得很好的着色效果,建议换一下损失函数试试,我目前的训练效果还可以 ------------------ 原始邮件 ------------------ 发件人: "Janspiry/Palette-Image-to-Image-Diffusion-Models" @.>; 发送时间: 2023年10月11日(星期三) 凌晨2:44 @.>; @.@.>; 主题: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37) 您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几 I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Heyy Glad to hear it! At least it proves the correctness of this repo. Would you mind sharing more details of the loss function? Is it like a image-level loss function for example structure-similarity loss?

BW, Jingsong

TumVink avatar Oct 21 '23 20:10 TumVink

The hybrid loss function you mentioned is a mixture of true variational lower bound and BCE.

Am i correct?


Von: 1228967342 @.***> Gesendet: Montag, 23. Oktober 2023 17:10:26 An: Janspiry/Palette-Image-to-Image-Diffusion-Models Cc: Jingsong Liu; Comment Betreff: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37)

混合损失函数

― Reply to this email directly, view it on GitHubhttps://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models/issues/37#issuecomment-1775424685, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARFVZL5XXPK77A2XZTUCMZTYA2CGFAVCNFSM6AAAAAAQCBIWN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGQZDINRYGU. You are receiving this because you commented.Message ID: @.***>

TumVink avatar Oct 23 '23 17:10 TumVink

您提到的混合损失函数是真实变分下界和 BCE 的混合。我对么? …… ________________________________ Von: 1228967342 @.> Gesendet: Montag, 23. Oktober 2023 17:10:26 An: Janspiry/Palette-Image-to-Image-Diffusion-Models 抄送:Jingsong Liu;Comment Betreff:回复:[Janspiry/Palette-Image-to-Image-Diffusion-Models] 着色训练不起作用(问题#37)混合损失函数 ― 直接回复此电子邮件,在 GitHub 上查看< #37(评论) >,或取消订阅< https://github.com/notifications/unsubscribe-auth/ARFVZL5XXPK77A2XZTUCMZTYA2CGFAVCNFSM6AAAAAAQCBIWN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGQZDINRYGU >。您收到此消息是因为您发表了评论。消息 ID:@.>

不是,只是很简单的混合

1228967342 avatar Oct 24 '23 08:10 1228967342

@1228967342 你好,我也遇到了相同的问题,请问可以分享一下你的混合损失设计吗

ludandandan avatar Dec 10 '23 01:12 ludandandan