openmmtools icon indicating copy to clipboard operation
openmmtools copied to clipboard

Multi-GPU Support for multistate runs

Open mpvenkatesh opened this issue 3 years ago • 5 comments

I am trying to locate the openmmtools functionality that corresponds to the 'DeviceIndex' property that can be passed to openmm's Simulation() to convey the available GPU devices. I am able to set the platform to 'CUDA' via cache.ContextCache(), and this makes my sample Parallel Tempering run entirely on one GPU, even though 8 are available. I have essentially packaged the example in https://openmmtools.readthedocs.io/en/0.18.1/api/generated/openmmtools.multistate.ParallelTemperingSampler.html in remd.py, and am calling it from the 8 GPU node with python remd.py. The intention is to have replicas run on separate GPUs (running each replica on all available GPUs may also need to be supported in some cases, but not essential).

(Note that my set-up is fine as I able to perform an MD simulation using just openmm on 8 GPUs, using a similar python call around a sample script)

Thanks! Venkatesh

Versions: openmm-7.5.1-py38h7850c2e_1 openmmtools-0.20.3-pyhd8ed1ab_0 NVIDIA V100 GPU: Driver Version: 450.119.04, CUDA Version: 11.0 Python 3.8.5

mpvenkatesh avatar Jul 08 '21 19:07 mpvenkatesh

Hi @mpvenkatesh . The multi-GPU support is based on MPI so you will have to execute your python driving script with mpirun (or similar).

If you are already doing this, it might be the case that OpenMM only sees that. To check this, you can use openmmtools.utils.get_available_platforms(). If it doesn't see all GPUs there might be something wrong with the env variables (e.g., CUDA_VISIBLE_DEVICES) or the installation.

Finally, for your first question, OpenMM sets the DeviceIndex through the Platform properties. See here. With the ContextCache, you can pass a dictionary of properties in the constructor when initializing the cache.

andrrizzi avatar Jul 16 '21 07:07 andrrizzi

Andrea,

I am using Open MPI 3.1.5. I am running in the interactive mode on a node with 8 GPUs and 1 CPU requested thus: srun -G 8 -n 1 --pty bash -i. I then launch this way: mpirun -np 1 python remd.py. Run this way, a single CPU process that uses only 1 GPU, even though there are 8. If I repeat the exercise with -n 1 and -np 8, 8 processes are launched, each independently running remd.py, with all of them still using only the first GPU.

When I import cuda from numba and interrogate the devices, all 8 GPUs are visible; CUDA_VISIBLE_DEVICES also shows all 8. However, when I ask openmmtools.utils.get_available_platforms, I get [<simtk.openmm.openmm.Platform; proxy of <Swig Object of type 'OpenMM::Platform *' at 0x7f3350c4f300> >, <simtk.openmm.openmm.Platform; proxy of <Swig Object of type 'OpenMM::Platform *' at 0x7f3350c4f0f0> >, <simtk.openmm.openmm.Platform; proxy of <Swig Object of type 'OpenMM::Platform *' at 0x7f3350c4f240> >]. openmm.Platform.getPlatformByName('CUDA') gives me <simtk.openmm.openmm.Platform; proxy of <Swig Object of type 'OpenMM::Platform *' at 0x7f6682d90450> >. Using that platform to set context_cache = cache.ContextCache(platform=platform) does not change the outcome.

Perhaps, if I am able to pass the DeviceIndex to cache.ContextCache() it would help; however the properties are expected to be cache parameters(self._lru = LRUCache(**kwargs)), see https://openmmtools.readthedocs.io/en/0.18.1/_modules/openmmtools/cache.html#ContextCache.

What am I missing?

mpvenkatesh avatar Jul 20 '21 20:07 mpvenkatesh

@mpvenkatesh , I think you might have to set the CUDA_VISIBLE_DEVICES variable correctly for each MPI process. We usually run MPI using a hostfile and a configfile that sets something like mpirun -env CUDA_VISIBLE_DEVICES X for each MPI process.

We use an internal python script to do this (also conda-installable): https://github.com/choderalab/clusterutils. It might help, although it doesn't support all systems and we use it mostly internally.

andrrizzi avatar Jul 21 '21 07:07 andrrizzi

I am able to get remd.py to run on a specific GPU by running mpirun -x CUDA_VISIBLE_DEVICES=X -n 1 python remd.py. When X is a list, only the first GPU in the list gets used. Setting -n to a value greater than 1 also results in only the first GPU being used. Further, this launches multiple remd.py calls in parallel, which is not the intention. The desired behavior is having 1 CPU-based process running python remd.py that sends replicas to each of the available GPUs on the node; so -n to mpirun must be 1. Please confirm that this is the intended way to run. ( build_mpirun_configfile python remd.py was not helpful)

mpvenkatesh avatar Jul 22 '21 00:07 mpvenkatesh

@mpvenkatesh, I see. That is not the setting ContextCache was optimized for. The job of ContextCache is to decide whether to re-use a previously created GPU Context or to create a new one. Currently, it always uses a previous Context when it can (i.e., when the two thermodynamic states differ by thermodynamic parameters that OpenMM allows to modify after context creation). For our replica exchange application and when using multiple MPI processes (not 1), this results in a near-optimal policy.

You might be able to extend ContextCache and implement a different policy, but the calls to OpenMM are blocking so I think the execution will be serial anyway. Also, openmmtools was not built for threads support so I'm not sure if there are easy workarounds for this.

andrrizzi avatar Jul 22 '21 08:07 andrrizzi