openmmtools
openmmtools copied to clipboard
[Question] How to parallelize replicas in REMD simulations?
Hi,
I would like to run REMD simulations using the ReplicaExchangeSampler
from this library. I have quite a few replicas for which I would like to run simulations in parallel to reduce the overall runtime. I see that the current implementation is using mpiplus
to distribute the workload over MPI workers and it works fine using the standard MPI script call over CPU workers.
Is it possible to assign different GPUs to different replicas using the current infrastructure? The best I could achieve is to have several MPI workers that simulate different replicas on the same GPU.
If it's not possible at the moment, how can I set a different DeviceIndex
to the platforms associated with each thermodynamic_state
?
Based on issue 516 and some of the replex discussion and scripts from @zhang-ivy it's clear you have to run it with a hostfile and configfile.
You can generate the files with clusterutils build_mpirun_configfile.py.
The file contents are very simple and can be reused - just need to change the hostfile contents to whatever node name you're running your job on and you can skip the build_mpirun_configfile for other runs.
Unless I have 4 separate instances of replica exchange writing to the same file, it appears to be running successfully for me.
When I was playing around with it in an interactive session, I had to change the "srun" call in the build_mpirun_configfile.py (line 215) to "mpirun" and manually set 2 environment variables to get build_mpirun_configfile.py to run.
export SLURM_JOB_NODELIST=$HOSTNAME
export CUDA_VISIBLE_DEVICES=0,1,2,3
(or whatever the available device ids are)
The hostfile only contains the node name repeated for the number of gpus you'll use (assuming you're on a single node):
exp-17-59
exp-17-59
exp-17-59
exp-17-59
and configfile contains:
np 1 -env CUDA_VISIBLE_DEVICES 0 python hremd_start.py :
-np 1 -env CUDA_VISIBLE_DEVICES 1 python hremd_start.py :
-np 1 -env CUDA_VISIBLE_DEVICES 2 python hremd_start.py :
-np 1 -env CUDA_VISIBLE_DEVICES 3 python hremd_start.py
Running it just looks like this:
build_mpirun_configfile.py --configfilepath configfile --hostfilepath hostfile "python hremd_start.py"
mpiexec.hydra -f hostfile -configfile configfile
I'll follow up if it turns out I was dreaming.
Of course the slurm output makes it look like 4 independent processes started since several print statements are repeated 4 times. These are simulation setup steps though and maybe they're being executed several times but ReplicaExchangeSampler run() is probably running a single instance because if I do
lsof -p xxxxx | grep nc
for each gpu process id, only one of the gpu processes is accessing and writing to the log files.
Seems like a good sign.
Hi @felixmusil and @Dan-Burns! I am trying to set up HREX MD with mpi, but in the tutorial, I can find only how to run all replicas on a single core. I would like to run each replica independently on several cores, but all replicas on the same GPU (we have 1 GPU per node). Do you know how to set up such a run? When I am using: mpiexec.hydra -n 8 python HREX.py, I get 8 copies of the HREX system (each one consists of 8 thermodynamic_states), and consequently, no calculations speed up.