latent-diffusion
latent-diffusion copied to clipboard
[Reproduction issue] Semantic image synthesis and layout-to-image cannot be reproduced
Can you provide inference scripts for semantic image synthesis and layout-to-image synthesis? I tried to use data loaders from the taming-transformers repo but got random noise outputs. The evaluation results are far from those reported in the paper. Thanks!
same question
Thank you very much for publishing your excellent research results. I am also interested in reproducing the layout-to-image model as well. Is there any reproduction code available? Thank you in advance for your consideration.
Also waiting for the release of the pretrained layout-to-image model trained from scratch on COCO and the dataet code. Thanks !!
Also waiting for the semantic synthesis training pipeline
Hi,
I managed to train the semantic image synthesis model. I first collected the flickr data according to readme from taming-transformers repo, and used sflckr.py as training dataset.
Then, I wrote the yaml config file according to yaml config file:
model:
base_learning_rate: 1.0e-06
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.0015
linear_end: 0.0205
log_every_t: 100
timesteps: 1000
loss_type: l1
first_stage_key: image
cond_stage_key: segmentation
image_size: 64
channels: 3
concat_mode: true
cond_stage_trainable: true
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 10000 ]
cycle_lengths: [ 10000000000000 ]
f_start: [ 1.e-6 ]
f_max: [ 1. ]
f_min: [ 1. ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 64
in_channels: 6
out_channels: 3
model_channels: 128
attention_resolutions:
- 32
- 16
- 8
num_res_blocks: 2
channel_mult:
- 1
- 4
- 8
num_heads: 8
first_stage_config:
target: ldm.models.autoencoder.VQModelInterface
params:
embed_dim: 3
n_embed: 8192
ckpt_path: models/first_stage_models/vq-f4/model.ckpt
ddconfig:
double_z: false
z_channels: 3
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.SpatialRescaler
params:
n_stages: 2
in_channels: 182
out_channels: 3
data:
target: main.DataModuleFromConfig
params:
batch_size: 12
num_workers: 5
wrap: False
train:
target: ldm.data.flickr.FlickrSegTrain # PUT YOUR DATASET
params:
size: 256
validation:
target: ldm.data.flickr.FlickrSegEval # PUT YOUR DATASET
params:
size: 256
lightning:
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 5000
max_images: 8
increase_log_steps: False
trainer:
benchmark: True
And last, I followed python main.py --base <config_above>.yaml -t --gpus 0, to train the model.
It did work. Here is a result coming from my training process:
conditions
samples

By the way, I find that the config yaml file doesn't load ckpt at first stage config
first_stage_config:
target: ldm.models.autoencoder.VQModelInterface
params:
embed_dim: 3
n_embed: 8192
ckpt_path: models/first_stage_models/vq-f4/model.ckpt # this line is missing
ddconfig:
double_z: false
I wonder whether this is the reason for failing in inference.
@otamic I saw your fantastic results. I am struggling with how to inference (test) by the pretrained model to generate landscape images from segmentation images. Could you share your code to inference (test) if you could?
@YorkNishi999
This's my inference code, which mostly comes from the log images in ddpm
import torch
import numpy as np
from scripts.sample_diffusion import load_model
from omegaconf import OmegaConf
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import save_image
from einops import rearrange
from ldm.data.flickr import FlickrSegEval
def ldm_cond_sample(config_path, ckpt_path, dataset, batch_size):
config = OmegaConf.load(config_path)
model, _ = load_model(config, ckpt_path, None, None)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
x = next(iter(dataloader))
seg = x['segmentation']
with torch.no_grad():
seg = rearrange(seg, 'b h w c -> b c h w')
condition = model.to_rgb(seg)
seg = seg.to('cuda').float()
seg = model.get_learned_conditioning(seg)
samples, _ = model.sample_log(cond=seg, batch_size=batch_size, ddim=True,
ddim_steps=200, eta=1.)
samples = model.decode_first_stage(samples)
save_image(condition, 'cond.png')
save_image(samples, 'sample.png')
if __name__ == '__main__':
config_path = 'models\ldm\semantic_synthesis256\config.yaml'
ckpt_path = 'models\ldm\semantic_synthesis256\model.ckpt'
dataset = FlickrSegEval(size=256)
ldm_cond_sample(config_path, ckpt_path, dataset, 4)
Note that there's one line missing as I descriped above in the config file.
I simply picked up some segmentations from the dataset to generate images, where you may want to make some changes to suit your needs.
@otamic I am very grateful for you to share your code!
I used your code and generated the images but it is low quality. I would make sure that you train the model from ckpt_path = 'models\ldm\semantic_synthesis256\model.ckpt' at first, then, you inference (generate) the images from semantic images. Am I correct?
My generated image is here:

@YorkNishi999
In fact, models\ldm\semantic_synthesis256\model.ckpt refers to the pretrained model downloaded from Pretrained LDMs when I wrote this code.
To test your own trained model, just change the path to something like logs/xxxx/checkpoints/last.ckpt after a training process. (So you are right.)
This's a result tested on the downloaded model:
condition
sample

And my trained model:
condtion
sample

It works fine here. So I wonder whether you just haven't trained your model long enough.
Works perfectly on my side, thanks @otamic !
@otamic Thank you for sharing your experiments!! I will retry it with some training..
@otamic I got the fine results after looking for my bugs (it is my fault).
Thank you again for your kindness!

@otamic Wow . that's nice. Can u share your dataloade code ? I want be sure about something. I will write my own :D
@SerdarHelli
I think you mean the dataset class in the config file:
data:
...
params:
...
train:
target: ldm.data.flickr.FlickrSegTrain # PUT YOUR DATASET
...
validation:
target: ldm.data.flickr.FlickrSegEval # PUT YOUR DATASET
...
If so, I used the code from sflckr.py as described above. There is a Examples class in the script:
class Examples(SegmentationBase):
def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
super().__init__(data_csv="data/sflckr_examples.txt",
data_root="data/sflckr_images",
segmentation_root="data/sflckr_segmentations",
size=size, random_crop=random_crop, interpolation=interpolation)
And I added my dataset referring to my own data (collected according to this) like that:
class FlickrSegTrain(SegmentationBase):
def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
super().__init__(data_csv='data/flickr/flickr_train.txt',
data_root='data/flickr/flickr_images',
segmentation_root='data/flickr/flickr_segmentations',
size=size, random_crop=random_crop, interpolation=interpolation)
class FlickrSegEval(SegmentationBase):
def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
super().__init__(data_csv='data/flickr/flickr_eval.txt',
data_root='data/flickr/flickr_images',
segmentation_root='data/flickr/flickr_segmentations',
size=size, random_crop=random_crop, interpolation=interpolation)
That's all I have done. (It's only very few changes, so I didn't post on it.)
To this point, I believe I have written everything needed to reproduce the semantic synthesis result.
Yes thanks @otamic https://github.com/CompVis/taming-transformers/blob/master/taming/data/sflckr.py I was searching this one actually. I know they wrote, but I didnt check out it :D
@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m
getting this image as a result, do you have any ideas why it can happen?
@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m
getting this image as a result, do you have any ideas why it can happen?
I think You should check out your config . For example , is your last condition stage input channel 182 ? How many labels you have for your dataset ?
@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m
getting this image as a result, do you have any ideas why it can happen?
I think You should check out your config . For example , is your last condition stage input channel 182 ? How many labels you have for your dataset ? @SerdarHelli I have changed it as well, in my case its is 35
I see , did you check out your batches ? I dont know , maybe you trained not enough .
@mmash98
Could you try a smaller batch size, such as 4? If it can't help, I have no other idea.
@mmash98
Could you try a smaller batch size, such as 4? If it can't help, I have no other idea.
I think . He didnt train enough. At 5k steps, I am getting same results.
Guys, In addition , should we train our vqgan ? I think we should train vqgan with our own data , if our domain is very different.
Edit : I am gettiing worse results with ldm+vqf4 than Gan for semantic image synthesis . Probably , I should train more . Or my data is very limited for ldm . Maybe on the limited data , ldm is not good
Also , you can train on the colab . I can share code.
@otamic Hey, may I ask a question? I follow your yaml and inference.py to training images with deepfashion which semantic has 24 categories, and I change 182 to 24. But my results is strange as shown below. I want to know is there any other things to notice or what I did wrong? Looking forward to your reply, thx so much!

@Kai-0515
I think you didn't successfully load the pretrained first stage model. Check that the missing line I mentioned is added, and make sure there is the ckpt file. I actually did have similar results, which is how I found the missing line.
@otamic You'r right! Thx very much for your quick reply!

Unlike the gan methods, the condition is converted to RGB image in ldm. So, your categories must be correct, otherwise you will give wrong cond. Also , you must be sure about autoencoder (vq, kl) .
Has anybody trained a model for the layout2image task yet? I'm not quiet sure how my Bounding boxes input is supposed to look like. Andy what a prpoer configuration would be? Thank you so much for any inputs. I know the layout2img-openimages256 config exists, but I'm not sure how the input is supposed to be.
@otamic do I understand it correctly that you train everything from scratch, the whole model except for the vq-f4? Is it also possible to skip training the unet and vae and only train the conditioning part?
@mauerflitzer
You are correct about my training. In my opinion, I think only training the conditioning part is impossible at LDM. First, how to supervise this training? Second, the unet structures of conditional and unconditional model are different. In this case, the number of channels at the unet input is doubled when conditioned. But it sounds like the Classifier Guided Diffusion in another conditional case.
@otamic I thought about freezing the unet and vae weights and taking a released checkpoint of 1.4 or maybe 1.5 and then swap out the conditioning part for the new one and start training on that.
@mauerflitzer
Sorry, I don't understand what you mean the ckeckpoint of 1.4 or 1.5. If the conditioning parts(τ_θ) work in the same way, I think you can just try it. Although I intuitively think it might not work.