raster-vision Support multi-GPU training and prediction

Support multi-GPU training and prediction

Open AdeelH opened this issue 2 years ago • 0 comments

🚀 Feature

Support multi-GPU training and prediction.

Motivation

Distributing the workload across multiple GPUs can considerably speed up the training and prediction process..

Pitch

PyTorch provides at least two ways of utilizing multiple GPUs:

DataParallel
DistributedDataParallel

DataParallel is trivial to implement but is not very efficient. DistributedDataParallel is more efficient and is the recommended approach, but more complicated to implement. I think Raster Vision should support both.

`DataParallel`

Using DataParallel would be as simple as wrapping the model at the end of Learner.setup_model():

self.model = DataParallel(self.model)

Plus small changes to ensure saving/loading weights works fine.

`DistributedDataParallel`

Use of DistributedDataParallel involves spawning multiple processes and initializing the model, loss function, and optimizer within each process.

This could be done by moving Learner.setup_training() inside the function that each process runs after the fork. Doing this for prediction might be trickier and might require modifying the learner backend.

Saving/loading weights and logging to TensorBoard will also tricky.

Alternatives

N/A

Additional context

Resources:

https://pytorch.org/tutorials/beginner/dist_overview.html
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
https://pytorch.org/docs/stable/notes/multiprocessing.html
https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51 (published Aug, 2021 -- might be outdated)

Mar 03 '22 14:03 AdeelH

raster-vision raster-vision copied to clipboard

Support multi-GPU training and prediction

🚀 Feature

Motivation

Pitch

DataParallel

DistributedDataParallel

Alternatives

Additional context

raster-vision
raster-vision copied to clipboard

`DataParallel`

`DistributedDataParallel`