FastPointTransformer
FastPointTransformer copied to clipboard
DDP/DP training - multigpu
Hi @chrockey, great work!
Can you guide me on how to set up multigpu training? I have only 20GB gpus available, and when using batch size of 2 I obtain poor performance (~6% lower mIoU and mAcc; probably due to the batch norm and batch size).
If I add multigpu support (DDP) according to the example from the ME repository the learning is blocked, i.e. it never starts.
Any help will be appreciated. You commented "multi-GPU training is currently not supported" in the code. Have you had similar issues as I mentioned?
Thanks!
Hi @helen1c,
Have you had similar issues as I mentioned?
No, I haven't. I was able to use DDP with PyTorch Lightning and ME together. However, I found a weird issue: the model's performance gets a little bit worse (~1%). That's why I do not use multi-GPU training in this repo. Anyway, here I provide you a code snippet to support DDP training:
You need to convert BN module into the synchronized BN before this line: https://github.com/POSTECH-CVLab/FastPointTransformer/blob/9d8793a1bb0bb5e2a6175f32dd116f88c8171d23/train.py#L39 as
if gpus > 1:
model = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
Then, set the DDP-related keyword arguments here: https://github.com/POSTECH-CVLab/FastPointTransformer/blob/9d8793a1bb0bb5e2a6175f32dd116f88c8171d23/train.py#L61 as
if gpus > 1:
kwargs["replace_sampler_ddp"] = True
kwargs["sync_batchnorm"] = False
kwargs["strategy"] = "ddp_find_unused_parameters_false"
I hope this helps your experiments.
@chrockey Unfortunately, this doesn't help. The same problem again.
Can you provide me versions of torch, cuda and pytorch lightning you are using?
Thanks for the quick reply though! :)
Sorry for the late reply. Here are the versions:
- CUDA: 11.3
- PyTorch: 1.12.1
- PyTorch Lightning: 1.8.2
- TorchMetrics: 0.11.0
FYI, I've just uploaded the environment.yaml file to the master branch, which you can refer to.
If you have further questions, please feel free to re-open this issue.
Hi @chrockey ,
I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.
Hi @chrockey ,
I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.
Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!
Hi @chrockey , I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.
Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!
Hi @Charlie839242, sorry for the late reply, unfortunately I still didn't solve the problem in the end, but I think it may be a problem with the pytorch-lighting setup, and I have now moved to a new point cloud processing repository https://github.com/Pointcept/Pointcept, which is also an amazing work including many SOTA methods.