openmm-torch icon indicating copy to clipboard operation
openmm-torch copied to clipboard

Multi-GPU support?

Open jchodera opened this issue 1 year ago • 7 comments

How can we best support parallelization of ML potentials across GPUs?

We're dealing with models that are small enough to be replicated on each GPU, and only O(N) data (positions, box vectors) needs to be sent and O(N) data (forces) accumulated. Models like ANI should be trivially parallelizable across atoms.

jchodera avatar Oct 14 '23 18:10 jchodera

OpenMM's infrastructure for parallel execution can in principle be applied to any Force. Internally it creates a separate ComputeContext for each device, and a separate copy of the KernelImpl for each one. All of them get executed in parallel, and any energies and forces they return are summed.

The challenge is figuring out what each of those KernelImpl's should do when it gets invoked. For many Forces this is simple. With most bonded forces, we can just divide up the bonds between GPUs, with each one computing a different subset. NonbondedForce is a bit more complicated, but we have ways of doing it.

What would TorchForce do? It doesn't know anything about the internal structure of the model. It just gets invoked once, taking all coordinates as inputs and producing the total energy as output. So the division of work would have to be done inside the model itself. We could pass in a pair of integers telling it how many devices it was executing on, and the index of the current device. The model would have to decide what to do with those inputs such that each device would do a similar amount of work, and the total energy would add up to the correct amount.

peastman avatar Oct 15 '23 04:10 peastman

Perhaps this would be something for NNPOps. We could provide there drop-in implementations of selected models that would be multi-GPU aware. This would need to be done on a model-by-model basis. I will leave this here for reference: https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel https://pytorch.org/docs/stable/multiprocessing.html

RaulPPelaez avatar Oct 15 '23 08:10 RaulPPelaez

Hi, I was wondering is there a way to run REMD (ReplicaExchangeSampler) with torchForce with multi GPU?

xiaowei-xie2 avatar May 17 '24 20:05 xiaowei-xie2

It should work exactly like any other force. Replica exchange is implemented at a higher level, using multiple Contexts for the replicas. It doesn't care how the forces in each Context are computed.

peastman avatar May 17 '24 21:05 peastman

Oh nice! Could you provide a simple example for how to do this? I came across this issue https://github.com/choderalab/openmmtools/issues/648, but could not figure out how to do it exactly.

xiaowei-xie2 avatar May 17 '24 22:05 xiaowei-xie2

I suggest asking on the openmmtools repo. The question isn't related to this package.

peastman avatar May 17 '24 22:05 peastman

Ok, I will do that. Thank you!

xiaowei-xie2 avatar May 17 '24 22:05 xiaowei-xie2

Message Passing GNN is still a difficult problem for multi-GPUs MD, we need exchage ghost node's feature between interaction layers

SyntaxSmith avatar Oct 08 '24 06:10 SyntaxSmith