MVSNet_pl
MVSNet_pl copied to clipboard
The correct way to enable multi-GPU training
Hi, @kwea123
I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping. For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:
"You seem to have configured a sampler in your DataLoader. This will be replaced "
" by `DistributedSampler` since `replace_sampler_ddp` is True and you are using"
" distributed training. Either remove the sampler from your DataLoader or set"
" `replace_sampler_ddp=False` if you want to use your custom sampler."
To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:
trainer = Trainer(#......
gpus=hparams.num_gpus,
replace_sampler_ddp=False,
distributed_backend='ddp' if hparams.num_gpus>1 else None,
# ......)
The model can be trained after this hyperparameter configured.
Is this the correct way to enable multi-GPU training manner? For some reason, I cannot install nvidia-apex for current server. Should and how do I use SyncBatchNorm for this model implementation? Does it bear on performance without SyncBN? Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?
Thanks a lot. 😊
hello,Have you solved the problem @sleeplessai
Hi, @geovsion. Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0. As the author didn't give quick reply, I folked the original repo manually to sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well