diffae
diffae copied to clipboard
Train AutoEncoder Only
Hi, Can we train the autoencoder only, by fixing the ddim? I want to train an autoencoder on a feature vector of size 64x64x256 and expect to get the z_sem which can work with the pretrained ddim. The feature vector was generated from an image using a different U-Net architecture. The feature vector has all the information of the original image, as we can easily transfer it back to the original image using the decoder of U-Net model. Now using the original image, I got the z_sem from the pre-trained diffae autoencoder, which can be used as a ground truth. Is there a some to train only the autoencoder with the feature vector and the ground truth z_sem?
You mean you want to apply DiffAE not on RGB images but a matrix of image features, e.g. VQ-VAE-like features? The thing you expect to get by this is a meaningful z_sem that DiffAE may provide? If so, it seems possible and you don't need to have a groundtruth z_sem for it. You just need to train DiffAE on tor of that image feature space instead of training DiffAE on the RGB space as usual. In this case, DiffAE learns to reconstruct the image features, i.e. 64x64x256, and at the same time learns to come up a useful z_sem.
Hi @phizaz, Yes, I am looking for something similar. I need to reconstruct the image using the diffusion model. The conditioning i.e. z_sem should come from image feature space while the DDIM should work on RGB space. I have a few doubts:
- Do I have to train only Diffusion AutoEncoder or DDIM as well?
- If we train Diffusion AutoEncoder only, then whether the z_sem generated from it will be compatible with pre-trained DDIM or not?
- What about the losses like LPIPS if we train in feature space instead of RGB?
- As per my understanding Diffusion AutoEncoder training uses autoencoder as well as diffusion model. If we train a model using a config file 'ffhq128_autoenc_130M', then it will be going to use both autoencoder and diffusion. Am I right?.
A few jargons might need to be made clear first.
- You already have an autoencoder that provides the feature space on which everything else will build up on.
- Diffusion autoencode is itself a kind of DDIM. It's not very clear to mention DiffAE and DDIM separately. In any case, I don't think you need another DDIM besides a DiffAE.
- You mentioned pretrained DDIM* which I'm not sure what it is.
- Definitely, the word autoencoder in DiffAE is NOT the same as in 1). You need to be careful with words here.
My imaginative picture about how should it look like is like this:
- You have a pretrained autoencoder that provides the image feature space.
- You train DiffAE on the image feature space with some loss function, I don't think you need LPIPS.
I don't think you need anymore than these two components.
- I have an autoencoder that transfers an RGB image into a feature space.
- Thanks for clarifying.
- Pretrained DDIM: I was referring to the model generated using 'ffhq128_ddpm_130M' config.
- Thanks for removing the confusion.
If I train DIFFAE on Image Feature Space, then I will need the decoder from my original autoencoder model to transfer the generated feature vector back to RGB space. Is there any way I can only train the semantic encoder of DIFFAE, keeping the DDIM part fixed? In that way, the semantic encoder will take the image feature space-> generates z_sem (a 512 vector using model.encode() ) -> z_sem will be used for manipulating the Conditional DDIM model which still works on RGB space.
Is there any way I can only train the semantic encoder of DIFFAE, keeping the DDIM part fixed?
I think you mean training only the semantic encoder while keeping the DDIM part fixed. Let assume that we have a pretrained DiffAE on potentially related dataset, it might be possible, not sure, no experiment on this.
Dear Authors,
Thanks for sharing this work. I have a question about how the Semantic Encoder (shown in Figure 2) is trained. I cannot find a related loss for the training of the Semantic Encoder. The paper only shows Eqn (6) and (9), but these two losses are used to train the "conditional DDIM" and "latent DDIM," not for the "Semantic Encoder."
The semantic encoder is trained end-to-end which means the training signal is propagated from the reconstruction loss function, through the diffusion model, UNET, then arrives at the semactic encoder. This means the encoder is encouraged to encode useful information to help denoising the whole image while only the corrupted version of it is avaible to the UNET at the time.
I see. Thanks for the reply.
Then, in this case, how can we ensure the code z_{sem} will have two separate parts, one for linear semantics and the other for stochastic details (as mentioned in the Abstract)? It seems there is no explicit regularization to encourage this kind of disentangling. Any insightful understanding about this?
z_sem should only encode semantic information leaving the stochasitc part be the job of the X_T.