DDPM-Pytorch
DDPM-Pytorch copied to clipboard
what parameter changes would I need to make sure it runs on our dataset?
I am running this code on set of images but getting thisu error " CUDA out of memory. Tried to allocate 150.06 GiB (GPU 0; 15.89 GiB total capacity; 720.18 MiB already allocated; 14.31 GiB free; 736.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. " I have updated the batch size, and also resize images to 224, 224 shape but it still giving me this CUDA error.
Can you please tell me what shold I do?
Thanks
Hello,
224x224 is still large for this model. Can you please try to follow the steps mentioned here and see if it works fine after that ?
Hi, Thank you for reply. It is running now. But if I have to run on 224 size then how can I do it? BTW I am taking im_size = 64
With 224x224 images, using the current code version it would be difficult, but you could try the following:
- Reduce the number of channels and layers significantly until single gpu memory is enough (but chances are it would not give good results).
- Right now the code does not support multi gpu, but feel free to make changes to have it run on multiple gpus.
- Use vae/vqvae to get 224x224->64x64 latents then train diffusion on single gpu on these 64x64 .During sampling feed the generated 64x64 to the decoder of vae/vqvae to get 224x224 image. By end of this month I will have a repo for stable diffusion that will allow you to do this.
Thank you for your response.
Hi,
I trained model on medical dataset and after sampling results are not as expected. Am I missing something? Please throw some light.
When you say results are not as expected, do you mean images generated are completely garbage or they are just not of that high quality ? Was the generation output improving throughout the training epochs ? Also Is it possible to share the model config and sample database image and generated output ?
Hi,
I am attaching config setting, output and input image
Model is improving during training.
Couple of things that I can think of. I see your images are grayscale, any specific reason to use 3 channels. Maybe try with im_channels : 1 Based on these images,I suspect that model needs to be trained more(I had used 40 for mnist itself), maybe train for 100/200 epochs.
Can you see if this helps ?
No images are not grayscale. It has 3 channels. But I will use epoch more.
Hi, I am attaching config setting, output and input image
![]()
![]()
hi there, how you did this? my dataset is also have 3 channel and also i did all the changes which is mention by @explainingai-code but i got size mismatch error.
Hi @xiaoxiao079 , It looks from the error that code is trying to load a checkpoint which is trained on a different than what you are currently using to train/infer. If this error is coming during training, there might already be a checkpoint with same name but trained using different configuration that throws error here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/train_ddpm.py#L49 If this error is during sampling then the config that you might be using might be incorrect during sampling here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/sample_ddpm.py#L73
