co-mod-gan icon indicating copy to clipboard operation
co-mod-gan copied to clipboard

The training process always be killed

Open SwordBearFire opened this issue 3 years ago • 4 comments

Hi,

Thank you for your great job, It's amazing. However, when I using co-mod-gan to train ffhq dataset by myself, the process always be killed, my device contain 4 1080ti gpus and each one has 12GB gpu memory, Ram memory is 32GB.

When I use resolution 512x512 ffhq tfrecord dataset to train the model, it shows killed. Could you tell me how much memory do you use? And what should I do? Thank you so much.

Best regards

SwordBearFire avatar May 18 '21 08:05 SwordBearFire

Preferably 8 GPUs with 16 GB (maybe 12 GB is OK) memory on each. Otherwise you have to reduce the batch size / network capacity.

zsyzzsoft avatar May 19 '21 03:05 zsyzzsoft

Preferably 8 GPUs with 16 GB (maybe 12 GB is OK) memory on each. Otherwise you have to reduce the batch size / network capacity.

Thank you for your kind reply. I try to use batch size as one for each GPU, the program still be killed. Maybe the main problem is the RAM OOM, not the GPU OOM.

MingtaoGuo avatar May 19 '21 03:05 MingtaoGuo

I have no idea what causes RAM OOM :(

zsyzzsoft avatar May 19 '21 17:05 zsyzzsoft

How come you guys are training :(

tiwarikaran avatar Jun 11 '21 18:06 tiwarikaran