raster-vision icon indicating copy to clipboard operation
raster-vision copied to clipboard

Support multi-GPU training and prediction

Open AdeelH opened this issue 2 years ago • 0 comments

🚀 Feature

Support multi-GPU training and prediction.

Motivation

Distributing the workload across multiple GPUs can considerably speed up the training and prediction process..

Pitch

PyTorch provides at least two ways of utilizing multiple GPUs:

  • DataParallel
  • DistributedDataParallel

DataParallel is trivial to implement but is not very efficient. DistributedDataParallel is more efficient and is the recommended approach, but more complicated to implement. I think Raster Vision should support both.

DataParallel

Using DataParallel would be as simple as wrapping the model at the end of Learner.setup_model():

self.model = DataParallel(self.model)

Plus small changes to ensure saving/loading weights works fine.

DistributedDataParallel

Use of DistributedDataParallel involves spawning multiple processes and initializing the model, loss function, and optimizer within each process.

This could be done by moving Learner.setup_training() inside the function that each process runs after the fork. Doing this for prediction might be trickier and might require modifying the learner backend.

Saving/loading weights and logging to TensorBoard will also tricky.

Alternatives

N/A

Additional context

Resources:

  • https://pytorch.org/tutorials/beginner/dist_overview.html
  • https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
  • https://pytorch.org/docs/stable/notes/multiprocessing.html
  • https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51 (published Aug, 2021 -- might be outdated)

AdeelH avatar Mar 03 '22 14:03 AdeelH