mala icon indicating copy to clipboard operation
mala copied to clipboard

Multiple GPUs using DataParallel

Open nerkulec opened this issue 2 years ago • 5 comments

This allows for utilization of multiple GPUs for training of MALA models. It is done with DataParallel, which in contrast to DistributedDataParallel does not require multiprocessing. Simply use

parameters.running.num_gpus = 4

No additional changes to python or slurm scripts are needed.

nerkulec avatar Sep 14 '23 18:09 nerkulec

Thanks for the PR! I really like this and I think we should implement this, before we move to DDP (which is the next thing on my todo list), so I just had a look at this PR. I have two suggestions/adjustments:

  1. I would get rid of the example, since it only showcases one changed parameter. We can simply update the documentation.
  2. I think the functionality of parameters.running.num_gpus = 4 can simply be absorbed into parameters.use_gpu. That value could simply be an int instead of a bool, without any drawbacks.

I have made these changes in a PR here: https://github.com/nerkulec/mala/pull/1, if that looks OK we could first merge that PR and then this one here.

RandomDefaultUser avatar Apr 25 '24 07:04 RandomDefaultUser

Oh wait, there is one potential problem, and that is when training with multiple GPUs and then loading to run with either only one GPU or MPI+GPU. I will test this right away!

RandomDefaultUser avatar Apr 25 '24 07:04 RandomDefaultUser

I like your changes :) I added a small fix to my part. Now the checks make much more sense. Feel free to merge when you resolve the potential problem you mentioned.

nerkulec avatar Apr 25 '24 09:04 nerkulec

I confirmed that inference pipeline indeed still works!

RandomDefaultUser avatar Apr 25 '24 12:04 RandomDefaultUser

In theory this works, once this is benchmarked we can merge it!

RandomDefaultUser avatar Apr 25 '24 14:04 RandomDefaultUser

The benchmarks showed that training on multi-gpu using DataParallel was slower than with one gpu. I'm closing this since we have the DDP implementation.

nerkulec avatar Aug 21 '24 13:08 nerkulec