generative-models
generative-models copied to clipboard
Training code for the Adversarial Diffusion Distillation(ADD) not available?
I was not able to find the code for the ADD training mechanism, when will the code be released?
Looking forward to update the training code.
Same question. Is the training code planned to be released soon?
Actually, if you look at the ADD paper, they train StyleGAN-T++ for 2M iterations at batch size 2048 on 128 A100s. This suggests that the project had a budget that allows for ~100K USD experiments. So I highly doubt the ordinary person is going to be able to replicate their result, even with the training code available.
It is probably more appropriate to think of the ADD model as training an SD model almost from scratch. The problem it learns is much harder than LCM - they have to go from noise straight to a highly polished image.
LCM never manages to do that as the original training process of SD is not designed to do few-step denoising, so my hypothesis is that ADD has to learn a lot of new "concepts".
@jon-chuang , thanks for your feedback, I tired to implement a similar training mechanism as what ADD is doing, but it seem to having have a lot of instability in the training, which doesn't yeild good images, but looking at the papers, I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?
@Mohan2351999 , Have you achieved good results with your ADD training? I've also tried training a ADD, but the images generated after a few training looked terrible, like those from a failed GAN training.
@fingerk28 I was getting a similar image which becomes complete noise, with longer training, which probably could be due to instability in the training, I still face the issue of 'nan' in the grad_norm of the discriminator while training. Please let me know if you find any success with your training. thanks.
I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?
Ok, you're right, colour me surprised. I expected stability AI (and all major for-profit labs) to retract details like that.
but it seem to having have a lot of instability in the training,
I have the same result (and others I've talked to have reported the same).
But GAN training is generally very hard to tune.
I still face the issue of 'nan' in the grad_norm of the discriminator while training.
I think in the ADD paper they mention using R1 gradient penalty as regularization. I have yet to try this.
Btw @Mohan2351999 do shoot me an email at chuang dot jon at gmail dot com
if you want to chat about this more offline. I'm quite determined to have this ADD training suceed.
Hi @jon-chuang, thanks for your answers, I have already tried including the R1 gradient penality, but still couldn't get rid of the "nan" in the gradient norm for the discriminator.
Thanks for sharing your contact, I will send you an email soon.
@Mohan2351999 @jon-chuang I have also tried reproduced ADD recently, and I have some doubts about the training data. Is it the Laion dataset ? Will the quality of training data have a significant impact on adversarial training?
@jon-chuang @Mohan2351999 Hi, have you obtained good generation results? I used the training method of ADD, but the generated images have color issues, such as oversaturation...
Just like this
And I don't know what the problem is...
Hey, there! While the code for ADD is still unpublished, I started working on my own implementation. In a couple of weeks I will be able to train and test my model. For my tests I have trained my own (toy) UNet on food101 dataset. And will further distill it.
Will be glad to receive any comments and pieces of advice on my work!
https://github.com/leffff/adversarial-diffusion-distillation/
Hi, the paper says that the step size of the teacher model is set to 1. I think this is unreasonable. I tried to use ddpm CIFAR10 to conduct ADD experiments. When the teacher model step size is 1, sampling is performed, and the result is a picture of completely random noise. Or is it that their teacher model is already sufficient to generate higher quality images in step 1?
you're right. The single step teacher is quite useless. You can see this from Table 1 d) by comparing the first and second row
Here are screenshots form the paper, proving they do only 1 teacher step, which is in my opinion unreasonable as we force student to produce samples of the best quality possible but in 4 steps instead of all the steps of the teacher, meaning, the teacher should make more steps.
But imagine, teacher makes less steps than the student. This means, generation quality of the teacher is worse than students'. Then why do we want students' predictions to be as close as possible to teachers'.
I do not understand this moment yet.
in this video https://www.youtube.com/watch?v=ZxPQtXu1Wbw the author says that the teacher makes 1000 steps)
@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!
@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!
I will soon change my UNet and dataset and switch either to Imagenet or Cifar10! If I succeed I will inform you! Waiting for your results :)
Okay I've figured out the answer.
The main contribution to distillation is made by the discriminator, while teacher is there to prevent overfitting and this is the reason the teacher only does 1 step.
@leffff Thanks for the explanation! Did you uncover any training hacks that were not mentioned in the paper? And are you getting good results for a single step?