latent-diffusion icon indicating copy to clipboard operation
latent-diffusion copied to clipboard

autoencoder for LDM

Open seung-kim opened this issue 3 years ago • 7 comments

Hi! Could you put which autoencoding models correspond to which LDMs on the table, please? Maybe I am missing this information somewhere, but it seems it's not clear which one is for which.

seung-kim avatar Jan 17 '22 17:01 seung-kim

@seung-kim I was struggling with this too. I ran the script scripts/download_first_stages.sh which downloaded all the autoencoders, With each autoencoder there is a config.yaml file that says the training data was ldm.data.openimages.FullOpenImagesTrain. So seems they were all trained on the OpenImages dataset?

@ablattmann @rromb could you please confirm this and also add the information to the README?

vvvm23 avatar Mar 13 '22 11:03 vvvm23

@seung-kim I was struggling with this too. I ran the script scripts/download_first_stages.sh which downloaded all the autoencoders, With each autoencoder there is a config.yaml file that says the training data was ldm.data.openimages.FullOpenImagesTrain. So seems they were all trained on the OpenImages dataset?

@ablattmann @rromb could you please confirm this and also add the information to the README?

Having the same question, have you fixed it yet?

Eudea avatar Oct 20 '22 01:10 Eudea

the class FullOpenImagesTrain does not exist, maybe the file of ldm/data/openimages.py is missing, could you check that @rromb @ablattmann

keyu-tian avatar Sep 03 '23 14:09 keyu-tian

Hi did anyone figure this out??

mia01 avatar Sep 13 '23 22:09 mia01

@mia01 @Eudea @vvvm23 @seung-kim I think im training VQVAEs well on OpenImages. Just with a random crop augmentation (resize to 384 then random crop to 256) and normalizing pixels from [0, 1] to [-1, 1]. For finetuning i use lr=4e-4, batch_size=1024. For from scratch i use lr=4e-6, batch_size=1024. I use Adam optimizer of betas=(0.5, 0.9) following https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/taming/models/vqgan.py#L128.

keyu-tian avatar Oct 08 '23 07:10 keyu-tian

@mia01 @Eudea @vvvm23 @seung-kim I think im training VQVAEs well on OpenImages. Just with a random crop augmentation (resize to 384 then random crop to 256) and normalizing pixels from [0, 1] to [-1, 1]. For finetuning i use lr=4e-4, batch_size=1024. For from scratch i use lr=4e-6, batch_size=1024. I use Adam optimizer of betas=(0.5, 0.9) following https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/taming/models/vqgan.py#L128.

Hi @keyu-tian , I am curious about the distribution of the length of image short side in OpenImages. The vae is trained using augmentation (resize to 384 then random crop to 256), which means all images are downsampled to 384?

wtliao avatar Feb 07 '24 00:02 wtliao

@mia01 @Eudea @vvvm23 @seung-kim I think im training VQVAEs well on OpenImages. Just with a random crop augmentation (resize to 384 then random crop to 256) and normalizing pixels from [0, 1] to [-1, 1]. For finetuning i use lr=4e-4, batch_size=1024. For from scratch i use lr=4e-6, batch_size=1024. I use Adam optimizer of betas=(0.5, 0.9) following https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/taming/models/vqgan.py#L128.

Hi @keyu-tian . I'm curious if you've done any experiments with VAE instead of VQGAN? I get the impression that the grid effect is hard to eliminate, should the discriminative loss weight be increased?

bu135 avatar Apr 22 '24 13:04 bu135