ConSinGAN icon indicating copy to clipboard operation
ConSinGAN copied to clipboard

Running On Multiple GPUs

Open playmakerbugger opened this issue 3 years ago • 5 comments

Hi I am running the image harmonization part of the model with a --train_stages 6 --max_size 350 and --lr_scale 0.5 to increase the quality of the images.

However, once I get to the 2 stage of the training, it crashes because of lack of CUDA memory. I altered the torch device for the model to accept more than 1 gpu (let's say gpus 0 and 1) and made changes to the model to be encapsulated in a DataParallel model so that it can run parallel on multiple GPUs. However, it still only runs on 1 GPU.

Do you have any suggestions to fix this issue?

playmakerbugger avatar Jun 23 '21 13:06 playmakerbugger

Without seeing the code it's difficult. Have you changed how the parameter --gpu is handled (in the main_train.py file)? By default it's set to 0 and later in the code we do (see here) if torch.cuda.is_available(): torch.cuda.set_device(opt.gpu) You might have to change that to get it to work.

tohinz avatar Jun 23 '21 14:06 tohinz

Hi, still having the problem. I changed that line to equal a torch device of two gpus (passed in set_device). It still runs on 1 gpu.

playmakerbugger avatar Jun 24 '21 12:06 playmakerbugger

Sorry for the late response. What kind of GPU are you running this on and how much VRAM does it have? I run all of my experiments on a single GPU with ~12GB VRAM without problems.

tohinz avatar Jun 30 '21 09:06 tohinz

GPU 0 with about 30000 MiB

playmakerbugger avatar Jul 07 '21 17:07 playmakerbugger

I have the same problem. Is there any solution?

Liz1317 avatar Jul 16 '22 05:07 Liz1317