denoising-diffusion-pytorch icon indicating copy to clipboard operation
denoising-diffusion-pytorch copied to clipboard

How long does it takes to train.

Open qsh-zh opened this issue 3 years ago • 24 comments

Thanks for your clean implementation sharing.

I try on celeba datasets. After 150k steps, the generated images are not well as it claimed in the paper and the flowers you show in the readme.

Is it something to do with the datasets or I need more time to train?

image

qsh-zh avatar Feb 24 '21 09:02 qsh-zh

Hi, I'm also traying to train this repo. What image resolution are you using? In the paper (Appendix B) they say they trained 256x256 CelebA-HQ for 500k steps of 64 batchsize. Did your loss plateau or is it still decreasing? And by the way, how much time did it take to train these 150k steps? what batchsize?

ariel415el avatar May 19 '21 07:05 ariel415el

Similar results after 145k on cifar. I wonder if it is harder to be trained than GAN or it is not stable enough yet...

IceClear avatar May 29 '21 14:05 IceClear

@ariel415el The loss plateau for the figure I show if my memory serves me well. I forget some details about the experiments, it ran like 36-48 hours on one 2080Ti. Batchsize is 32 with fp16, unet dim 64.

qsh-zh avatar May 29 '21 17:05 qsh-zh

@IceClear Do you mind sharing sample images if you could?

qsh-zh avatar May 29 '21 17:05 qsh-zh

@IceClear Do you mind sharing sample images if you could?

Sure, here it is (sample 186) after 186k. sample-186

IceClear avatar May 30 '21 11:05 IceClear

I've been training using this repo and am getting (very) good results on 256x256 images after around 800,000 global steps (batch of 16). Score-based models are known to take more compute to train vs a comparable GAN, so perhaps more training time is required in your cases?

Smith42 avatar Jun 12 '21 08:06 Smith42

Thanks @Smith42 , The thing is for me and @qshzh the train loss plateaus so I'm not sure how more steps can help. Did your loss continue decreasing throughout training? Can you share some of your result images here so that we know what to expect? BTW, for how long did you train the model? i guess it was more than 2 days.

ariel415el avatar Jun 13 '21 07:06 ariel415el

Can you share some of your result images here so that we know what to expect?

@ariel415el unfortunately I can't share the results just yet, but should have a preprint out soon that I can share.

The thing is for me and @qshzh the train loss plateaus so I'm not sure how more steps can help. Did your loss continue decreasing throughout training?

The loss didn't seem to plateau for me until very late in the training cycle, but this is with training on a dataset with order 10^6 examples.

BTW, for how long did you train the model? i guess it was more than 2 days.

On a single V100 it took around 2 weeks of training.

Smith42 avatar Jun 14 '21 08:06 Smith42

@IceClear @ariel415el This is the fid curve on cifar10 for sampled 1k images. image The 26 step in the figure is global 108000 steps. For 50k samples, its fid is 15.13.

qsh-zh avatar Jun 14 '21 15:06 qsh-zh

The image size is 256, batchsize is 32, and 480k steps, which does not look good. image

Sumching avatar Jun 20 '21 05:06 Sumching

@Sumching, @qshzh, @IceClear @ariel415el, @Smith42 Guys, how low are your training losses? In my case, the noise prediction losses are several hundred ~ thousands. Is this right?

gwang-kim avatar Sep 08 '21 11:09 gwang-kim

@Sumching, @qshzh, @IceClear @ariel415el, @Smith42 Guys, how low are your training losses? In my case, the noise prediction losses are several hundred ~ thousands. Is this right?

That's way too high, I'm getting sub 0.1 once fully trained. Have you checked your normalisations?

Smith42 avatar Sep 23 '21 10:09 Smith42

@Smith42 hi, I trained it with cifar-10. The batch size is 16. The image size is 128. The loss is about 0.05. But the generated images are seemed as being blurred.

jiangxiluning avatar Feb 21 '22 01:02 jiangxiluning

@Smith42 hi, I trained it with cifar-10. The batch size is 16. The image size is 128. The loss is about 0.05. But the generated images are seemed as being blurred.

I use a fork of Phil's code in my paper and am not getting blurring problems. Maybe there is something up with your hyperparameters?

Smith42 avatar Mar 08 '22 16:03 Smith42

Hi @Smith42 & @jiangxiluning when you say you get a loss below 0.1 are you using a L1 or L2 loss?

cajoek avatar Mar 18 '22 10:03 cajoek

@cajoek for me, it is L1.

jiangxiluning avatar Mar 19 '22 09:03 jiangxiluning

L1 for me too

Smith42 avatar Mar 22 '22 14:03 Smith42

Thanks @jiangxiluning @Smith42!

My loss unfortunately plateaus at about 0.10-0.15 so I decided to plot the mean L1 loss over one epoch versus the timestep t and I noticed that the loss stays quite high for low values of t, as can be seen i this figure. Do you know if that is expected? Loss_vs_timestep (L1 loss vs timestep t after many epochs on a small dataset. Convergence is not quite reached yet)

cajoek avatar Mar 22 '22 16:03 cajoek

@Smith42 Would you be able to show some samples/results from training your CelebA model? It seems that a lot of other people are struggling to reproduce the results shown in the paper.

malekinho8 avatar Jun 03 '22 20:06 malekinho8

@Smith42 Would you be able to show some samples/results from training your CelebA model? It seems that a lot of other people are struggling to reproduce the results shown in the paper.

@malekinho8 I ran a fork of lucidrains' model on a large galaxy image data set here, not on CelebA. However, the galaxy imagery is well replicated with this codebase, so I expect it will work okay on CelebA too.

Smith42 avatar Jun 05 '22 08:06 Smith42

@jiangxiluning Can you pleasure share your code? I am also training on cifar10 and the loss does not go below 0.7. Below is my trainer model trainer = Trainer( diffusion, new_train, train_batch_size = 32, train_lr = 1e-4, train_num_steps = 500000, # total training steps gradient_accumulate_every = 2, # gradient accumulation steps ema_decay = 0.995, # exponential moving average decay amp = True # turn on mixed precision ) model = Unet( dim = 16, dim_mults = (1, 2, 4) )

DushyantSahoo avatar Jul 14 '22 19:07 DushyantSahoo

Hi, I got the same problem in cifar10. The model generated failed images even after 150k steps. Did you succeeded?

@jiangxiluning Can you pleasure share your code? I am also training on cifar10 and the loss does not go below 0.7. Below is my trainer model trainer = Trainer( diffusion, new_train, train_batch_size = 32, train_lr = 1e-4, train_num_steps = 500000, # total training steps gradient_accumulate_every = 2, # gradient accumulation steps ema_decay = 0.995, # exponential moving average decay amp = True # turn on mixed precision ) model = Unet( dim = 16, dim_mults = (1, 2, 4) )

greens007 avatar Aug 16 '22 07:08 greens007

Hi: so cifar10 contains tiny pictures 32x32 - it is naturally going to look blurry if you resize to 128x128

yiyixuxu avatar Sep 19 '22 21:09 yiyixuxu

Thanks for your clean implementation sharing.

I try on celeba datasets. After 150k steps, the generated images are not well as it claimed in the paper and the flowers you show in the readme.

Is it something to do with the datasets or I need more time to train?

image

Excuse me, do you modify the code or parameters during training, or load the pre-training weight file, the loss will drop to nan during my training

177488ZL avatar Feb 28 '23 13:02 177488ZL