raster-vision
raster-vision copied to clipboard
Support multi-GPU training and prediction
🚀 Feature
Support multi-GPU training and prediction.
Motivation
Distributing the workload across multiple GPUs can considerably speed up the training and prediction process..
Pitch
PyTorch provides at least two ways of utilizing multiple GPUs:
-
DataParallel
-
DistributedDataParallel
DataParallel
is trivial to implement but is not very efficient. DistributedDataParallel
is more efficient and is the recommended approach, but more complicated to implement. I think Raster Vision should support both.
DataParallel
Using DataParallel
would be as simple as wrapping the model at the end of Learner.setup_model()
:
self.model = DataParallel(self.model)
Plus small changes to ensure saving/loading weights works fine.
DistributedDataParallel
Use of DistributedDataParallel
involves spawning multiple processes and initializing the model, loss function, and optimizer within each process.
This could be done by moving Learner.setup_training()
inside the function that each process runs after the fork. Doing this for prediction might be trickier and might require modifying the learner backend.
Saving/loading weights and logging to TensorBoard will also tricky.
Alternatives
N/A
Additional context
Resources:
- https://pytorch.org/tutorials/beginner/dist_overview.html
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
- https://pytorch.org/docs/stable/notes/multiprocessing.html
- https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51 (published Aug, 2021 -- might be outdated)