latent-diffusion
latent-diffusion copied to clipboard
What is progressive_row & diffusion_row in training?
I've successfully trained latent diffusion on AFHQ dataset. But i'm having a hard time interpreting the results of it. During training it produce these images in log directory:
diffusion_row: Is this a forward process of diffusion model?mask: I'm training an unconditional LDM without any inpainting. Then why it's being used here?progressive_row: What is a progressive row? Is this a reverse diffusion process?inputs:reconstruction:samples:
Additionally, the images are being saved with a naming convention of `inputs_gs-045000_e-000087_b-000108.png`. Could you please clarify the meaning of "gs" in this context?
My yaml files for training AFHQ dataset are:
Autoencoder YAML
model:
base_learning_rate: 4.5e-06
target: taming.models.vqgan.VQModel
params:
embed_dim: 3
n_embed: 1024
monitor: val/rec_loss
ddconfig:
double_z: false
z_channels: 3
resolution: 128
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1,2,4]
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
params:
disc_conditional: false
disc_in_channels: 3
disc_start: 0
disc_weight: 0.75
codebook_weight: 1.0
data:
target: main.DataModuleFromConfig
params:
batch_size: 20
num_workers: 16
wrap: true
train:
target: ldm.data.afhq.AFHQCatTrain
params:
size: 128
# crop_size: 128
validation:
target: ldm.data.afhq.AFHQCatValidation
params:
size: 128
# crop_size: 128
Latent Diffusion YAML
model:
base_learning_rate: 2.0e-06
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.0015
linear_end: 0.0195
num_timesteps_cond: 1
log_every_t: 100
timesteps: 1000
first_stage_key: image
image_size: 32
channels: 3
monitor: val/loss_simple_ema
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32
in_channels: 3
out_channels: 3
model_channels: 224
attention_resolutions:
# note: this isn\t actually the resolution but
# the downsampling factor, i.e. this corresnponds to
# attention on spatial resolution 8,16,32, as the
# spatial reolution of the latents is 64 for f4
- 8
- 4
- 2
num_res_blocks: 2
channel_mult:
- 1
- 2
- 3
- 4
num_head_channels: 32
first_stage_config:
# target: taming.models.vqgan.VQModel
target: ldm.models.autoencoder.VQModelInterface
params:
embed_dim: 3
n_embed: 1024
ckpt_path: models/first_stage_models/afhq-cat-vq/model.ckpt
ddconfig:
double_z: false
z_channels: 3
resolution: 128
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [ 1,2,4 ]
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
params:
disc_conditional: False
disc_in_channels: 3
disc_start: 10000
disc_weight: 0.5
codebook_weight: 1.0
cond_stage_config: __is_unconditional__
data:
target: main.DataModuleFromConfig
params:
batch_size: 10
num_workers: 5
wrap: true
train:
target: ldm.data.afhq.AFHQCatTrain
params:
size: 128
validation:
target: ldm.data.afhq.AFHQCatValidation
params:
size: 128
lightning:
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 5000
max_images: 8
increase_log_steps: False
trainer:
benchmark: True
Please reference this functionlog_images https://github.com/CompVis/latent-diffusion/blob/main/ldm/models/diffusion/ddpm.py#L1251
@GrandpaXun242 can you help me with this? #287
@pseudo-usama Congratulate for your training process! But there is still some distortation in sample results? Note that you used batch_size=10 for training, how many epochs or steps does it need to produce the inference results as shown? Besides, you've mentioned some hyparams in ldm's training config, you could browse autoencoder.py and openaimodel.py in detail, respectively.
It took about 75 epochs and about 12 hours of training time for above results. And it took about 15 GB of graphic card.
@CharmsGraker yes there are some poor results in sampling. But i guess it can be solve with further training. And another thing is that for some reason i had to reduce input output size and latent code size. So maybe that's could be responsible for some bad samples
It took about 75 epochs and about 12 hours of training time for above results. And it took about 15 GB of graphic card.
Were both the VAE and the Latent Diffusion model trained for 75 epochs?