generative-models icon indicating copy to clipboard operation
generative-models copied to clipboard

Training code for the Adversarial Diffusion Distillation(ADD) not available?

Open Mohan2351999 opened this issue 1 year ago • 19 comments

I was not able to find the code for the ADD training mechanism, when will the code be released?

Mohan2351999 avatar Dec 08 '23 21:12 Mohan2351999

Looking forward to update the training code.

tnickMoxuan avatar Dec 11 '23 07:12 tnickMoxuan

Same question. Is the training code planned to be released soon?

m-muaz avatar Dec 12 '23 23:12 m-muaz

Actually, if you look at the ADD paper, they train StyleGAN-T++ for 2M iterations at batch size 2048 on 128 A100s. This suggests that the project had a budget that allows for ~100K USD experiments. So I highly doubt the ordinary person is going to be able to replicate their result, even with the training code available.

It is probably more appropriate to think of the ADD model as training an SD model almost from scratch. The problem it learns is much harder than LCM - they have to go from noise straight to a highly polished image.

LCM never manages to do that as the original training process of SD is not designed to do few-step denoising, so my hypothesis is that ADD has to learn a lot of new "concepts".

jon-chuang avatar Jan 09 '24 14:01 jon-chuang

@jon-chuang , thanks for your feedback, I tired to implement a similar training mechanism as what ADD is doing, but it seem to having have a lot of instability in the training, which doesn't yeild good images, but looking at the papers, I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?

Mohan2351999 avatar Jan 09 '24 16:01 Mohan2351999

@Mohan2351999 , Have you achieved good results with your ADD training? I've also tried training a ADD, but the images generated after a few training looked terrible, like those from a failed GAN training.

fingerk28 avatar Jan 10 '24 01:01 fingerk28

@fingerk28 I was getting a similar image which becomes complete noise, with longer training, which probably could be due to instability in the training, I still face the issue of 'nan' in the grad_norm of the discriminator while training. Please let me know if you find any success with your training. thanks.

Mohan2351999 avatar Jan 10 '24 16:01 Mohan2351999

I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?

Ok, you're right, colour me surprised. I expected stability AI (and all major for-profit labs) to retract details like that.

but it seem to having have a lot of instability in the training,

I have the same result (and others I've talked to have reported the same).

But GAN training is generally very hard to tune.

I still face the issue of 'nan' in the grad_norm of the discriminator while training.

I think in the ADD paper they mention using R1 gradient penalty as regularization. I have yet to try this.

jon-chuang avatar Jan 11 '24 11:01 jon-chuang

Btw @Mohan2351999 do shoot me an email at chuang dot jon at gmail dot com if you want to chat about this more offline. I'm quite determined to have this ADD training suceed.

jon-chuang avatar Jan 11 '24 11:01 jon-chuang

Hi @jon-chuang, thanks for your answers, I have already tried including the R1 gradient penality, but still couldn't get rid of the "nan" in the gradient norm for the discriminator.

Thanks for sharing your contact, I will send you an email soon.

Mohan2351999 avatar Jan 15 '24 16:01 Mohan2351999

@Mohan2351999 @jon-chuang I have also tried reproduced ADD recently, and I have some doubts about the training data. Is it the Laion dataset ? Will the quality of training data have a significant impact on adversarial training?

YangPanHZAU avatar Jan 16 '24 08:01 YangPanHZAU

@jon-chuang @Mohan2351999 Hi, have you obtained good generation results? I used the training method of ADD, but the generated images have color issues, such as oversaturation...

Just like this 9d4ff71a3074ad9af42b3ebcc51c63b

And I don't know what the problem is...

MqLeet avatar Jan 17 '24 11:01 MqLeet

Hey, there! While the code for ADD is still unpublished, I started working on my own implementation. In a couple of weeks I will be able to train and test my model. For my tests I have trained my own (toy) UNet on food101 dataset. And will further distill it.

Will be glad to receive any comments and pieces of advice on my work!

https://github.com/leffff/adversarial-diffusion-distillation/

leffff avatar Feb 13 '24 15:02 leffff

Hi, the paper says that the step size of the teacher model is set to 1. I think this is unreasonable. I tried to use ddpm CIFAR10 to conduct ADD experiments. When the teacher model step size is 1, sampling is performed, and the result is a picture of completely random noise. Or is it that their teacher model is already sufficient to generate higher quality images in step 1?

digbangbang avatar Feb 22 '24 08:02 digbangbang

you're right. The single step teacher is quite useless. You can see this from Table 1 d) by comparing the first and second row

jonaskohler avatar Feb 22 '24 09:02 jonaskohler

Снимок экрана 2024-02-22 в 12 21 15 Снимок экрана 2024-02-22 в 12 21 02

Here are screenshots form the paper, proving they do only 1 teacher step, which is in my opinion unreasonable as we force student to produce samples of the best quality possible but in 4 steps instead of all the steps of the teacher, meaning, the teacher should make more steps.

But imagine, teacher makes less steps than the student. This means, generation quality of the teacher is worse than students'. Then why do we want students' predictions to be as close as possible to teachers'.

I do not understand this moment yet.

in this video https://www.youtube.com/watch?v=ZxPQtXu1Wbw the author says that the teacher makes 1000 steps)

leffff avatar Feb 22 '24 09:02 leffff

@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!

digbangbang avatar Feb 22 '24 09:02 digbangbang

@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!

I will soon change my UNet and dataset and switch either to Imagenet or Cifar10! If I succeed I will inform you! Waiting for your results :)

leffff avatar Feb 22 '24 10:02 leffff

Okay I've figured out the answer.

The main contribution to distillation is made by the discriminator, while teacher is there to prevent overfitting and this is the reason the teacher only does 1 step.

leffff avatar Feb 28 '24 09:02 leffff

@leffff Thanks for the explanation! Did you uncover any training hacks that were not mentioned in the paper? And are you getting good results for a single step?

jonaskohler avatar Feb 29 '24 21:02 jonaskohler