FastPointTransformer icon indicating copy to clipboard operation
FastPointTransformer copied to clipboard

DDP/DP training - multigpu

Open helen1c opened this issue 2 years ago • 7 comments

Hi @chrockey, great work!

Can you guide me on how to set up multigpu training? I have only 20GB gpus available, and when using batch size of 2 I obtain poor performance (~6% lower mIoU and mAcc; probably due to the batch norm and batch size).

If I add multigpu support (DDP) according to the example from the ME repository the learning is blocked, i.e. it never starts.

Any help will be appreciated. You commented "multi-GPU training is currently not supported" in the code. Have you had similar issues as I mentioned?

Thanks!

helen1c avatar Jan 28 '23 17:01 helen1c

Hi @helen1c,

Have you had similar issues as I mentioned?

No, I haven't. I was able to use DDP with PyTorch Lightning and ME together. However, I found a weird issue: the model's performance gets a little bit worse (~1%). That's why I do not use multi-GPU training in this repo. Anyway, here I provide you a code snippet to support DDP training:

You need to convert BN module into the synchronized BN before this line: https://github.com/POSTECH-CVLab/FastPointTransformer/blob/9d8793a1bb0bb5e2a6175f32dd116f88c8171d23/train.py#L39 as

if gpus > 1:
    model = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(model)
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Then, set the DDP-related keyword arguments here: https://github.com/POSTECH-CVLab/FastPointTransformer/blob/9d8793a1bb0bb5e2a6175f32dd116f88c8171d23/train.py#L61 as

if gpus > 1:
    kwargs["replace_sampler_ddp"] = True
    kwargs["sync_batchnorm"] = False
    kwargs["strategy"] = "ddp_find_unused_parameters_false"

I hope this helps your experiments.

chrockey avatar Jan 29 '23 12:01 chrockey

@chrockey Unfortunately, this doesn't help. The same problem again.

Can you provide me versions of torch, cuda and pytorch lightning you are using?

Thanks for the quick reply though! :)

helen1c avatar Jan 29 '23 15:01 helen1c

Sorry for the late reply. Here are the versions:

  • CUDA: 11.3
  • PyTorch: 1.12.1
  • PyTorch Lightning: 1.8.2
  • TorchMetrics: 0.11.0

FYI, I've just uploaded the environment.yaml file to the master branch, which you can refer to.

chrockey avatar Feb 01 '23 09:02 chrockey

If you have further questions, please feel free to re-open this issue.

chrockey avatar Feb 07 '23 13:02 chrockey

Hi @chrockey ,

I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

lishuai-97 avatar Feb 07 '23 15:02 lishuai-97

Hi @chrockey ,

I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!

Charlie839242 avatar Feb 11 '24 16:02 Charlie839242

Hi @chrockey , I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!

Hi @Charlie839242, sorry for the late reply, unfortunately I still didn't solve the problem in the end, but I think it may be a problem with the pytorch-lighting setup, and I have now moved to a new point cloud processing repository https://github.com/Pointcept/Pointcept, which is also an amazing work including many SOTA methods.

lishuai-97 avatar Mar 03 '24 04:03 lishuai-97