segmentation-guided-diffusion icon indicating copy to clipboard operation
segmentation-guided-diffusion copied to clipboard

The generated image contains noise points

Open helinrui opened this issue 1 year ago • 4 comments

I trained an unconditional model using my own dataset(approximately 1000 images), but I followed your command to generate images. Why is it a bunch of noisy images. CUDA_VISIBLE_DEVICES={DEVICES} python3 main.py
--mode eval_many
--model_type DDIM
--img_size 256
--num_img_channels 3
--dataset {DATASET_NAME}
--eval_batch_size 8
--eval_sample_size 100

helinrui avatar Apr 24 '24 08:04 helinrui

Hi, sorry to hear that this issue is happening. How many epochs did you train for, and what do the images sampled during training look like in the samples folder of your model directory?

EDIT: I found a bug with training models for RGB (not greyscale) images. Fixing now, I'll be in touch!

nickk124 avatar Apr 24 '24 13:04 nickk124

Hi, I found a bug where the number of network output channels was not set properly; fixed in commit 28592394398dd2590c72b96aa0bad9b8b1d36f91. I tested the fix by training an unconditional model from scratch on CIFAR-10 and it seems to be working. Any luck now?

nickk124 avatar Apr 24 '24 14:04 nickk124

Thank you for your answer. I have trained according to your updated code for a total of 400 rounds, and the results are good now. But I have encountered another issue regarding multi gpu training as follows. CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main.py --mode train --model_type DDIM --img_size 256 --num_img_channels 3 --dataset {DATASET_NAME} --img_dir {DATA_FOLDER} --train_batch_size 16 --eval_batch_size 8 --num_epochs 400 When I finish epoch 0 ,the error is as follows: Traceback (most recent call last): File "/home/helinrui/slns/segmentation-guided-diffusion-main/main.py", line 412, in main( File "/home/helinrui/slns/segmentation-guided-diffusion-main/main.py", line 335, in main train_loop( File "/home/helinrui/slns/segmentation-guided-diffusion-main/training.py", line 114, in train_loop noise_pred = model(noisy_images, timesteps, return_dict=False)[0] File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise raise exception TypeError: Caught TypeError in replica 3 on device 3. Original Traceback (most recent call last): File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/home/helinrui/anaconda3/envs/mig/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) TypeError: forward() missing 2 required positional arguments: 'sample' and 'timestep'

If I only use a single gpu, I can train normally.

Problem 2: when I run the evaluation command and set -- eval_sample_size 100, but it cannot generate 100 samples, only the same number of samples as -- eval_batch_size 8 can be generated, which is 8 samples

helinrui avatar Apr 29 '24 16:04 helinrui

Hi,

My apologies for the late reply, this fell through the cracks!

For your first problem, I haven't seen this issue and have had no issues doing multi-GPU training myself. Just to confirm, does your system/CUDA detect that you do indeed have four GPUs 0,1,2,3 as you specified in your run command?

For your second problem, are you using --mode eval or --mode eval_many in your run command? You want eval_many.

nickk124 avatar Jul 24 '24 15:07 nickk124

@helinrui @nickk124 Hi, this is because the number of images in the folder of your training dataset is not divisible by the batch size, which causes the last batch to be smaller than the batch size. The solution is to add drop_last=True in train_dataloader = torch.utils.data.DataLoader() in main.py.

DBook111 avatar Dec 17 '24 08:12 DBook111