MVSNet_pl icon indicating copy to clipboard operation
MVSNet_pl copied to clipboard

The correct way to enable multi-GPU training

Open sleeplessai opened this issue 3 years ago • 2 comments

Hi, @kwea123

I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping. For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:

"You seem to have configured a sampler in your DataLoader. This will be replaced "
" by `DistributedSampler` since `replace_sampler_ddp` is True and you are using"
" distributed training. Either remove the sampler from your DataLoader or set"
" `replace_sampler_ddp=False` if you want to use your custom sampler."

To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:

trainer = Trainer(#......
                  gpus=hparams.num_gpus,
                  replace_sampler_ddp=False,
                  distributed_backend='ddp' if hparams.num_gpus>1 else None,
                  # ......)

The model can be trained after this hyperparameter configured.

Is this the correct way to enable multi-GPU training manner? For some reason, I cannot install nvidia-apex for current server. Should and how do I use SyncBatchNorm for this model implementation? Does it bear on performance without SyncBN? Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?

Thanks a lot. 😊

sleeplessai avatar Nov 03 '21 17:11 sleeplessai

hello,Have you solved the problem @sleeplessai

Geo-Tell avatar Dec 03 '21 12:12 Geo-Tell

Hi, @geovsion. Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0. As the author didn't give quick reply, I folked the original repo manually to sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well

sleeplessai avatar Dec 04 '21 18:12 sleeplessai