DG-Net icon indicating copy to clipboard operation
DG-Net copied to clipboard

multi-gpu training

Open tau-yihouxiang opened this issue 5 years ago • 13 comments

I checked that torch.nn.DataParallel has been used. I wonder why multi-gpu model won't work. Thanks in advance.

tau-yihouxiang avatar Jul 11 '19 01:07 tau-yihouxiang

trainer is a high-level container. We need to specify the leaf models in trainer.

if num_gpu>1:
    #trainer.teacher_model = torch.nn.DataParallel(trainer.teacher_model, gpu_ids)
    trainer.id_a = torch.nn.DataParallel(trainer.id_a, gpu_ids)
    trainer.gen_a.enc_content = torch.nn.DataParallel(trainer.gen_a.enc_content, gpu_ids)
    trainer.gen_a.mlp_w1 = torch.nn.DataParallel(trainer.gen_a.mlp_w1, gpu_ids)
    trainer.gen_a.mlp_w2 = torch.nn.DataParallel(trainer.gen_a.mlp_w2, gpu_ids)
    trainer.gen_a.mlp_w3 = torch.nn.DataParallel(trainer.gen_a.mlp_w3, gpu_ids)
    trainer.gen_a.mlp_w4 = torch.nn.DataParallel(trainer.gen_a.mlp_w4, gpu_ids)
    trainer.gen_a.mlp_b1 = torch.nn.DataParallel(trainer.gen_a.mlp_b1, gpu_ids)
    trainer.gen_a.mlp_b2 = torch.nn.DataParallel(trainer.gen_a.mlp_b2, gpu_ids)
    trainer.gen_a.mlp_b3 = torch.nn.DataParallel(trainer.gen_a.mlp_b3, gpu_ids)
    trainer.gen_a.mlp_b4 = torch.nn.DataParallel(trainer.gen_a.mlp_b4, gpu_ids)
    for dis_model in trainer.dis_a.cnns:
        dis_model = torch.nn.DataParallel(dis_model, gpu_ids)

This code works on multiple GPUs. You may have a try. Note that you also need to modify the code about saving model to save model.module.

However, it is not the best solution. We still work on this. You might notice I did not include the decoder.

    trainer.gen_a.dec = torch.nn.DataParallel(trainer.gen_a.dec, gpu_ids)

It is due to the adaptive instance normalisation, which can not be duplicated on multi-gpu.

layumi avatar Jul 11 '19 04:07 layumi

Thank you! This is really helpful.

tau-yihouxiang avatar Jul 12 '19 01:07 tau-yihouxiang

trainer is a high-level container. We need to specify the leaf models in trainer.

if num_gpu>1:
    trainer.teacher_model = torch.nn.DataParallel(trainer.teacher_model, gpu_ids)
    trainer.id_a = torch.nn.DataParallel(trainer.id_a, gpu_ids)
    trainer.gen_a.enc_content = torch.nn.DataParallel(trainer.gen_a.enc_content, gpu_ids)
    trainer.gen_a.mlp_w1 = torch.nn.DataParallel(trainer.gen_a.mlp_w1, gpu_ids)
    trainer.gen_a.mlp_w2 = torch.nn.DataParallel(trainer.gen_a.mlp_w2, gpu_ids)
    trainer.gen_a.mlp_w3 = torch.nn.DataParallel(trainer.gen_a.mlp_w3, gpu_ids)
    trainer.gen_a.mlp_w4 = torch.nn.DataParallel(trainer.gen_a.mlp_w4, gpu_ids)
    trainer.gen_a.mlp_b1 = torch.nn.DataParallel(trainer.gen_a.mlp_b1, gpu_ids)
    trainer.gen_a.mlp_b2 = torch.nn.DataParallel(trainer.gen_a.mlp_b2, gpu_ids)
    trainer.gen_a.mlp_b3 = torch.nn.DataParallel(trainer.gen_a.mlp_b3, gpu_ids)
    trainer.gen_a.mlp_b4 = torch.nn.DataParallel(trainer.gen_a.mlp_b4, gpu_ids)
    for dis_model in trainer.dis_a.cnns:
        dis_model = torch.nn.DataParallel(dis_model, gpu_ids)

This code works on multiple GPUs. You may have a try. Note that you also need to modify the code about saving model to save model.module.

However, it is not the best solution. We still work on this. You might notice I did not include the decoder.

    trainer.gen_a.dec = torch.nn.DataParallel(trainer.gen_a.dec, gpu_ids)

It is due to the adaptive instance normalisation, which can not be duplicated on multi-gpu.

I notice that F.batch_norm() is used in class AdaptiveInstanceNorm2d, is it the reason?

Phi-C avatar Jul 17 '19 08:07 Phi-C

Hi @ChenXingjian

Not really. It is due to the value of w and b in adaptive instance normalisation layer. https://github.com/NVlabs/DG-Net/blob/master/networks.py#L822-L823

We access the w and b on the fly, and use assign_adain_params to obtain the current parameters. https://github.com/NVlabs/DG-Net/blob/master/networks.py#L236

For pytorch DataParallel, it splits the batch into several parts and duplicates the network into all gpus, which does not match the size of w and b.

For example, we use the min-batch of 8 samples and have two gpus. The input of each GPU is 4 samples. But the w and b is 8, since they are duplicated from the original full model.

layumi avatar Jul 18 '19 01:07 layumi

@layumi Thank you, it's really helpful. But any reference to modify the code?

Phi-C avatar Jul 18 '19 02:07 Phi-C

Hi @ChenXingjian I am working on it and checking the results. If everything goes well, I will upload the code in the next week.

layumi avatar Jul 29 '19 09:07 layumi

It seems it works with multi-GPUs when you put the "assign_adain_params" function into the "Decoder" class.

FreemanG avatar Jul 29 '19 10:07 FreemanG

@FreemanG Yes. You are right. We could copy two encoder+decoder as one function at the beginning, so there will not be any problem about mismatched dimension.

In fact, I have written the code, and I am checking the result before I release it.

layumi avatar Jul 29 '19 10:07 layumi

Dear all,

I just added the support for multi-gpu training. You are welcomed to check out it.

  • About memory usage:

You still need two 10G+ GPUs for now. I have not written the support for fp16 with multiple-gpu. (I will consider to support it in the near future.)

Some losses are still calculated on the first GPU, so the memory usage of the first gpu is larger than the second gpu.

The main reason is that copy.deepcopy now not supports for multi-gpu. So for some losses and forward functions, I still keep them running on the first GPU.

  • About speed:

I tested it on my two P6000 (the speed is close to GTX1080)

Single GPU takes about 1.1s for one iteration at the beginning.

Two GPUs take about 0.9s for one iteration at the beginning.

(Since we add the teacher model calculation at the 30000th iteration, the speed will slow down after the 30000th iteration.)

layumi avatar Aug 01 '19 01:08 layumi

Great :+1:

FreemanG avatar Aug 01 '19 02:08 FreemanG

is it possible to use nn.parallel.replicate instead of deepcopy?

ramondoo avatar Aug 08 '19 11:08 ramondoo

@FreemanG Yes. You are right. We could copy two encoder+decoder as one function at the beginning, so there will not be any problem about mismatched dimension.

In fact, I have written the code, and I am checking the result before I release it.

Hi, Thank you very much for implementing the multi-GPU training version,may I ask where the method you mentioned (copy two encoder+decoder as one function at the beginning) is reflected in the code, I did not find it in your latest version. Thank you very much.

Xt-Chen avatar Aug 28 '20 15:08 Xt-Chen

Hi @layumi, if you are still here, please elaborate one more time on how you could make the adaptive instance normalization layer work in a multi-GPU mode with nn.DataParallel. I looked through the code and version history, but I didn't see any substantial amendments compared to the first commit.

Thank you

qasymjomart avatar May 30 '21 08:05 qasymjomart