DDPM-Pytorch what parameter changes would I need to make sure it runs on our dataset?

I am running this code on set of images but getting thisu error " CUDA out of memory. Tried to allocate 150.06 GiB (GPU 0; 15.89 GiB total capacity; 720.18 MiB already allocated; 14.31 GiB free; 736.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. " I have updated the batch size, and also resize images to 224, 224 shape but it still giving me this CUDA error.

Can you please tell me what shold I do?

Thanks

Jan 01 '24 07:01 Rushi117108

Hello,

224x224 is still large for this model. Can you please try to follow the steps mentioned here and see if it works fine after that ?

Jan 01 '24 07:01 explainingai-code

Hi, Thank you for reply. It is running now. But if I have to run on 224 size then how can I do it? BTW I am taking im_size = 64

Jan 01 '24 07:01 Rushi117108

With 224x224 images, using the current code version it would be difficult, but you could try the following:

Reduce the number of channels and layers significantly until single gpu memory is enough (but chances are it would not give good results).
Right now the code does not support multi gpu, but feel free to make changes to have it run on multiple gpus.
Use vae/vqvae to get 224x224->64x64 latents then train diffusion on single gpu on these 64x64 .During sampling feed the generated 64x64 to the decoder of vae/vqvae to get 224x224 image. By end of this month I will have a repo for stable diffusion that will allow you to do this.

Jan 01 '24 08:01 explainingai-code

Thank you for your response.

Jan 01 '24 09:01 Rushi117108

Hi,

I trained model on medical dataset and after sampling results are not as expected. Am I missing something? Please throw some light.

Jan 01 '24 12:01 Rushi117108

When you say results are not as expected, do you mean images generated are completely garbage or they are just not of that high quality ? Was the generation output improving throughout the training epochs ? Also Is it possible to share the model config and sample database image and generated output ?

Jan 01 '24 12:01 explainingai-code

Hi, I am attaching config setting, output and input image config output image1_0_png rf 679690475fa46b1e44696e692efcb4bc

Jan 01 '24 13:01 Rushi117108

Model is improving during training.

Jan 01 '24 13:01 Rushi117108

Couple of things that I can think of. I see your images are grayscale, any specific reason to use 3 channels. Maybe try with im_channels : 1 Based on these images,I suspect that model needs to be trained more(I had used 40 for mnist itself), maybe train for 100/200 epochs.

Can you see if this helps ?

Jan 01 '24 13:01 explainingai-code

No images are not grayscale. It has 3 channels. But I will use epoch more.

Jan 01 '24 13:01 Rushi117108

Hi, I am attaching config setting, output and input image

hi there, how you did this? my dataset is also have 3 channel and also i did all the changes which is mention by @explainingai-code but i got size mismatch error.

Jan 13 '24 08:01 xiaoxiao079

Hi @xiaoxiao079 , It looks from the error that code is trying to load a checkpoint which is trained on a different than what you are currently using to train/infer. If this error is coming during training, there might already be a checkpoint with same name but trained using different configuration that throws error here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/train_ddpm.py#L49 If this error is during sampling then the config that you might be using might be incorrect during sampling here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/sample_ddpm.py#L73

Jan 14 '24 04:01 explainingai-code

DDPM-Pytorch DDPM-Pytorch copied to clipboard

what parameter changes would I need to make sure it runs on our dataset?

DDPM-Pytorch
DDPM-Pytorch copied to clipboard