latent-diffusion icon indicating copy to clipboard operation
latent-diffusion copied to clipboard

What is progressive_row & diffusion_row in training?

Open pseudo-usama opened this issue 2 years ago • 6 comments

I've successfully trained latent diffusion on AFHQ dataset. But i'm having a hard time interpreting the results of it. During training it produce these images in log directory:

  • diffusion_row: Is this a forward process of diffusion model? diffusion_row_gs-045000_e-000087_b-000108
  • mask: I'm training an unconditional LDM without any inpainting. Then why it's being used here? mask_gs-045000_e-000087_b-000108
  • progressive_row: What is a progressive row? Is this a reverse diffusion process? progressive_row_gs-045000_e-000087_b-000108
  • inputs: inputs_gs-045000_e-000087_b-000108
  • reconstruction: reconstruction_gs-045000_e-000087_b-000108
  • samples: samples_x0_quantized_gs-045000_e-000087_b-000108

Additionally, the images are being saved with a naming convention of `inputs_gs-045000_e-000087_b-000108.png`. Could you please clarify the meaning of "gs" in this context?

My yaml files for training AFHQ dataset are:

Autoencoder YAML
model:
  base_learning_rate: 4.5e-06
  target: taming.models.vqgan.VQModel
  params:
    embed_dim: 3
    n_embed: 1024
    monitor: val/rec_loss

    ddconfig:
      double_z: false
      z_channels: 3
      resolution: 128
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult: [1,2,4]
      num_res_blocks: 2
      attn_resolutions: []
      dropout: 0.0
    lossconfig:
      target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
      params:
        disc_conditional: false
        disc_in_channels: 3
        disc_start: 0
        disc_weight: 0.75
        codebook_weight: 1.0

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 20
    num_workers: 16
    wrap: true
    train:
      target: ldm.data.afhq.AFHQCatTrain
      params:
        size: 128
        # crop_size: 128
    validation:
      target: ldm.data.afhq.AFHQCatValidation
      params:
        size: 128
        # crop_size: 128
Latent Diffusion YAML
model:
  base_learning_rate: 2.0e-06
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.0015
    linear_end: 0.0195
    num_timesteps_cond: 1
    log_every_t: 100
    timesteps: 1000
    first_stage_key: image
    image_size: 32
    channels: 3
    monitor: val/loss_simple_ema
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32
        in_channels: 3
        out_channels: 3
        model_channels: 224
        attention_resolutions:
        # note: this isn\t actually the resolution but
        # the downsampling factor, i.e. this corresnponds to
        # attention on spatial resolution 8,16,32, as the
        # spatial reolution of the latents is 64 for f4
        - 8
        - 4
        - 2
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 3
        - 4
        num_head_channels: 32
    first_stage_config:
      # target: taming.models.vqgan.VQModel
      target: ldm.models.autoencoder.VQModelInterface
      params:
        embed_dim: 3
        n_embed: 1024
        ckpt_path: models/first_stage_models/afhq-cat-vq/model.ckpt
        ddconfig:
          double_z: false
          z_channels: 3
          resolution: 128
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [ 1,2,4 ]
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
          params:
            disc_conditional: False
            disc_in_channels: 3
            disc_start: 10000
            disc_weight: 0.5
            codebook_weight: 1.0
    cond_stage_config: __is_unconditional__
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 10
    num_workers: 5
    wrap: true
    train:
      target: ldm.data.afhq.AFHQCatTrain
      params:
        size: 128
    validation:
      target: ldm.data.afhq.AFHQCatValidation
      params:
        size: 128


lightning:
  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 5000
        max_images: 8
        increase_log_steps: False

  trainer:
    benchmark: True

pseudo-usama avatar Jun 18 '23 11:06 pseudo-usama

Please reference this functionlog_images https://github.com/CompVis/latent-diffusion/blob/main/ldm/models/diffusion/ddpm.py#L1251

GrandpaXun242 avatar Jun 24 '23 03:06 GrandpaXun242

@GrandpaXun242 can you help me with this? #287

pseudo-usama avatar Jun 26 '23 15:06 pseudo-usama

@pseudo-usama Congratulate for your training process! But there is still some distortation in sample results? Note that you used batch_size=10 for training, how many epochs or steps does it need to produce the inference results as shown? Besides, you've mentioned some hyparams in ldm's training config, you could browse autoencoder.py and openaimodel.py in detail, respectively.

CharmsGraker avatar Jul 11 '23 16:07 CharmsGraker

It took about 75 epochs and about 12 hours of training time for above results. And it took about 15 GB of graphic card.

pseudo-usama avatar Jul 11 '23 16:07 pseudo-usama

@CharmsGraker yes there are some poor results in sampling. But i guess it can be solve with further training. And another thing is that for some reason i had to reduce input output size and latent code size. So maybe that's could be responsible for some bad samples

pseudo-usama avatar Jul 11 '23 17:07 pseudo-usama

It took about 75 epochs and about 12 hours of training time for above results. And it took about 15 GB of graphic card.

Were both the VAE and the Latent Diffusion model trained for 75 epochs?

5huanghuai avatar Mar 08 '25 09:03 5huanghuai