perses
perses copied to clipboard
Speed up protein mutation charge changing repex consistency test
This test is currently very slow, so we need to speed up.
See: https://github.com/choderalab/perses/issues/1044#issuecomment-1165575916
@jchodera @ijpulidos : Here's the log from the last 2 iterations of running ALA->ARG (charge changing):
*******************************
DEBUG:openmmtools.multistate.multistatesampler:Iteration 4999/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2adef5fe7dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took 0.003s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 7092/93312 attempted swaps (7.6%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took 6.890s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took 3.667s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2adef5e56ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2adef5e56f70>
DEBUG:openmmtools.multistate.multistatereporter:Iteration 4999 not on the Checkpoint Interval of 10. Sampler State not written.
DEBUG:openmmtools.utils:Storing sampler states took 0.000s
DEBUG:openmmtools.utils:Writing iteration information to storage took 0.188s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 10.748s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:10.473674, at 2022-Jun-25-01:52:35 (consuming total wall clock time 14:32:48.370056).
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:openmmtools.multistate.multistatesampler:Iteration 5000/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2adef5fe7dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took 0.003s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 8050/93312 attempted swaps (8.6%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took 6.910s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took 3.721s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2adef5e56ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2adef5e56f70>
DEBUG:openmmtools.utils:Storing sampler states took 0.008s
DEBUG:openmmtools.utils:Writing iteration information to storage took 0.215s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 10.851s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:00, at 2022-Jun-25-01:52:35 (consuming total wall clock time 14:32:48.749468).
vs the last 2 iterations of THR->ALA (neutral):
DEBUG:openmmtools.multistate.multistatesampler:Iteration 4999/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2b7a99d82dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took 0.000s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 684/3456 attempted swaps (19.8%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took 1.919s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took 0.554s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2b7a99bf3ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2b7a99bf3f70>
DEBUG:openmmtools.multistate.multistatereporter:Iteration 4999 not on the Checkpoint Interval of 10. Sampler State not written.
DEBUG:openmmtools.utils:Storing sampler states took 0.000s
DEBUG:openmmtools.utils:Writing iteration information to storage took 0.020s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 2.495s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:02.436523, at 2022-Jun-24-03:43:01 (consuming total wall clock time 3:23:02.616504).
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:openmmtools.multistate.multistatesampler:Iteration 5000/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2b7a99d82dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took 0.000s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 796/3456 attempted swaps (23.0%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took 1.943s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took 0.556s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2b7a99bf3ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2b7a99bf3f70>
DEBUG:openmmtools.utils:Storing sampler states took 0.005s
DEBUG:openmmtools.utils:Writing iteration information to storage took 0.053s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 2.554s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:00, at 2022-Jun-24-03:43:01 (consuming total wall clock time 3:23:02.735137).```
I just analyzed the dipeptide charge changing test data: https://github.com/choderalab/perses/issues/1044#issuecomment-1167913833
@ijpulidos : This should speed the test up a lot:
move = mcmc.LangevinDynamicsMove(timestep=4.0 * unit.femtoseconds,
collision_rate=1.0 / unit.picosecond,
n_steps=50,
reassign_velocities=False,
n_restart_attempts=20)
Note that we want to wait until openmmtools 0.21.5 is released so that we can use the updated LangevinDynamicsMove
with the LangevinMiddleIntegrator
By using openmm-nightly
and LangevinDynamicsMove
we are getting almost a 4x speedup for the neutral test:
This means that for the neutral mutation test, for our target of 3000 iterations, we are talking about ~25 minutes in a A4000 GPU (0.25 s per iteration, 3000 iterations and 2 transformations -- forward and backward --). Please note that this GPU is significantly faster than our current GPU CI workflow.
For the charge mutation (explicit solvent) the speedup is similar (considering there are more replicas/states)
that would mean the tests would take about ~1.5 hours for our target of 1000 iterations on a A4000 GPU (1.2 s per iteration, 1000 iterations and 4 transformations as per the test goes)
Great! Also, I just checked and the current tests use 3000 iterations -- Why are you using 1000 iterations? Is the test still passing (i.e. assert DDG < 6 * dDDG, f"DDG ({DDG}) is greater than 6 * dDDG ({6 * dDDG})"
doesnt raise an error), even with 1000 iterations?
Great! Also, I just checked and the current tests use 3000 iterations -- Why are you using 1000 iterations? Is the test still passing (i.e.
assert DDG < 6 * dDDG, f"DDG ({DDG}) is greater than 6 * dDDG ({6 * dDDG})"
doesnt raise an error), even with 1000 iterations?
I was thinking about 1000 iterations for the computations using your comment here. The test passed, but I just run it once, even though the numbers in that same thread seem to imply it might have issues with 1000 iterations. I guess we can bump it up to 2000 to be "safe" and it would still just be over 1 hour on the CI, which should be okay.
Ah that comment was referring to the charge changing mutations. For the neutral mutations, you'll have to see my comment here
But both of those comments are for experiments using 50 steps per iteration, instead of 250 steps (which is what we normally use in our production runs). Are you using 250 steps per iteration, or 50 steps?
Ah that comment was referring to the charge changing mutations. For the neutral mutations, you'll have to see my comment here
I see, well I guess we can use 3000 iterations and it would be 1.5-2 hours on a 1080Ti (we have to check our GPU CI has a similar performance but I suspect our GPU CI is currently slower, since it is using older Tesla K40).
But both of those comments are for experiments using 50 steps per iteration, instead of 250 steps (which is what we normally use in our production runs). Are you using 250 steps per iteration, or 50 steps?
50 steps, as the tests go.
hss = HybridRepexSampler(mcmc_moves=mcmc.LangevinDynamicsMove(timestep=4.0 * unit.femtoseconds,
collision_rate=1.0 / unit.picosecond,
n_steps=50,
reassign_velocities=False,
n_restart_attempts=20,
constraint_tolerance=1e-06),
replica_mixing_scheme='swap-all',
hybrid_factory=htf,
online_analysis_interval=None)
First of all, great news about the speedup from using openmm-dev
and the new LangevinMiddleIntegrator
.
I'm a bit surprised at why there is an 8x difference in iteration time between the charge and neutral tests. I think there are more thermodynamic states in the charge test, but any ideas what might lead to the 8x difference in iteration time?
I see, well I guess we can use 3000 iterations and it would be 1.5-2 hours on a 1080Ti (we have to check our GPU CI has a similar performance but I suspect our GPU CI is currently slower, since it is using older Tesla K40).
Sounds good!
I'm a bit surprised at why there is an 8x difference in iteration time between the charge and neutral tests. I think there are more thermodynamic states in the charge test, but any ideas what might lead to the 8x difference in iteration time?
The neutral test uses 12 replicas, the charge test use 36 replicas. Additionally, the neutral test only requires two transformations (A->T, T->A), whereas charge test requires 4. I think the 8x difference makes sense given the extra thermodynamic states and transformations.
Oh.. I see the problem. Its an 8x difference per iteration (without considering the extra transformations)...
Interestingly, when comparing the neutral vs charge test iteration time with the LangevinSplittingDynamicsMove and openmm 7.7, charge is in fact 3x slower than neutral.
For LangevinDynamicsMove + openmm nightly, I wonder if the GPUs being used for neutral vs charge are different? @ijpulidos Any other ideas why the charge is more than 3x slower than the neutral?
@zhang-ivy Ah good catch, I just checked and I did use a wrong node in one of the tests (I misread ls-gpu
nodes on Lilac to be 1080ti and these are only 1080). This only affects the first plot (comparison between neutral tests). I'll re-run this test on the correct node and update the plot once I get the results.
Comparing neutral vs charge, we have to consider that the tests currently use 12 replicas for neutral and 36 for charge (which also uses explicit solvent), that may help explaining the differences in iteration times.
Comparing neutral vs charge, we have to consider that the tests currently use 12 replicas for neutral and 36 for charge (which also uses explicit solvent), that may help explaining the differences in iteration times.
This would be responsible for a 3x difference, but it's hard to imagine the 1080Ti vs 1080 is responsible for the remaining remaining ~2.7x iteration time difference.
I've made a few comments on https://github.com/choderalab/perses/pull/1065 to suggest some ways we might be able to speed things up a bit.
This would be responsible for a 3x difference, but it's hard to imagine the 1080Ti vs 1080 is responsible for the remaining remaining ~2.7x iteration time difference.
Yeah, that doesn't explain the remaining difference, I just updated the results.
I've made a few comments on #1065 to suggest some ways we might be able to speed things up a bit.
Will take a look at those and let you know once I get the chance to test them. Thanks for taking the time to look into this!
I've made a few comments on https://github.com/choderalab/perses/pull/1065 to suggest some ways we might be able to speed things up a bit.
@jchodera : Thanks for the feedback! I just responded to all your comments.
Yeah, that doesn't explain the remaining difference
I just double checked my perses barnase:barstar paper data -- I ran A->T dipeptide (12 replicas) and R->A dipeptide (24 replicas), both in solvent, using 2 gpus, and am seeing that A->T takes ~0.68 sec per iteration and R->A takes ~1.5 sec per iteration on lu nodes (GeForce RTX 2080). Note this is with LangevinDynamicsMove (using middle integrator) and openmm 8.0beta (not the nightly dev).
@ijpulidos : Can you try re-running all the tests? And if you are still seeing an 8-fold difference, let's set up a quick chat to discuss what might be causing the difference as I am not seeing the same slowdown in my data.
@ijpulidos : Can you try re-running all the tests? And if you are still seeing an 8-fold difference, let's set up a quick chat to discuss what might be causing the difference as I am not seeing the same slowdown in my data.
Please refer to my previous comment for the updated information https://github.com/choderalab/perses/issues/1058#issuecomment-1217359114 . I settled on using my local A4000 GPU due to some replicability issues when using lilac 1080ti nodes.
I re-run the tests and updated the plots. Basically we were facing the problem that @zhang-ivy has already warned us about in the last part of https://github.com/choderalab/perses/pull/1054#issuecomment-1178224781
We have a slowdown problem when we generate the HTF and run the simulation in the same script, this was causing the extra overhead. To avoid this I had to save the HTFs to pickle files and then load them to run the simulation in a separate script. We don't know why this is happening but we will try to debug this further on some other issue report.
Basically we can now verify the consistency on the updated plots where you get a ~4x iteration time difference when comparing the two tests (that is comparing between the same colored lines accross the two plots). We need to refactor the tests to account for things discussed in this thread.
To address the slowdown problem, we don't actually want to run separate scripts for the test, so Iván and I discussed that the next short term action item is to see if generating the htf, saving the htf as a pickle file, deleting the htf
, and then loading it in from the pickle file prevents the slowdown -- if it does, we can adapt the tests to do this. And then we can open an issue to investigate this issue further in the long term.
Wow, this is bizarre. Can someone point me to the code that sets up the HybridTopologyFactory that exhibits this weird behavior? I wonder if it's leaving the ContextCaches in a weird state that might be forcing them to use the CPU platform some of the time.
To address the slowdown problem, we don't actually want to run separate scripts for the test, so Iván and I discussed that the next short term action item is to see if generating the htf, saving the htf as a pickle file, deleting the htf, and then loading it in from the pickle file prevents the slowdown -- if it does, we can adapt the tests to do this. And then we can open an issue to investigate this issue further in the long term.
This sounds like a reasonable approach to get this out the door---we can fix this with #1084.
Can someone point me to the code that sets up the HybridTopologyFactory that exhibits this weird behavior?
Here's the neutral test: https://github.com/choderalab/perses/blob/main/perses/tests/test_repex.py#L6-L84 You should be able to reproduce the problem with this test.
@zhang-ivy : I just tried running that test, but it doesn't seem to actually run a simulation.
EDIT: I figured it out! My OpenEye license on lilac had expired, causing an error whose stdout/stderr error message was eaten by pytest! Adding -s
will skip the capturing of stdout/stderr.
OK, I'm running this on an A40, and have noticed a few things:
- The GPU is definitely in use during the setup phase. This is likely using one or more
ContextCache
s - After
hss.setup()
, we then swap this out with two new emptyContextCache
s without cleaning up the old one - When it runs, it hits a ton of
NaN
s and has to keep doing integrator restarts, which will slow things down a lot:
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
...
@ijpulidos: Can you remind me how I can get pytest
to have this test generate per-iteration verbose output of timing information? I'm not having any luck with combinations like
$ pytest -s --verbose --log-level=DEBUG --log-cli-level=DEBUG perses/tests/test_repex.py::test_RESTCapableHybridTopologyFactory_repex_neutral_mutation
I managed to get the iteration info to print by running the test as __main__
and inserting
import logging
LOGLEVEL="DEBUG"
_logger = logging.getLogger()
_logger.setLevel(LOGLEVEL)
The current main
version runs each iteration at ~1.4 s/iteration for the first phase and ~1.4 s/iteration for the second phase.
However, I don't have something to compare it to where I serialize out the state and reload it in a different process to see it get faster by >2x. Do you have an example of that?
The GPU is definitely in use during the setup phase. This is likely using one or more ContextCaches
Yes, Iván and I discussed this yesterday -- we are wondering whether its the geometry engine, which doesn't actually use a ContextCache, just a context: https://github.com/choderalab/perses/blob/main/perses/rjmc/geometry.py#L453-L463
The only other place that uses contexts in the PointMutationExecutor
pipeline is when we validate endstate energies, but that part is disabled for the test, so I think its the geometry engine
However, I don't have something to compare it to where I serialize out the state and reload it in a different process to see it get faster by >2x. Do you have an example of that?
You can generate the htf using: https://github.com/choderalab/perses/blob/main/perses/tests/test_repex.py#L34-L44 Then save a pickled htf using:
with open(os.path.join(outdir, f"name.pickle"), "wb") as f:
pickle.dump(htf, f)
Then run the test, skipping the htf generation and instead loading in the pickled htf using:
htf = pickle.load(open(os.path.join(outdir, f"name.pickle"), "rb" ))
I don't have an example pickled htf, but maybe Iván does.
However, I don't have something to compare it to where I serialize out the state and reload it in a different process to see it get faster by >2x. Do you have an example of that?
In the attached tar.gz file you can find a bash script to run the tests. It basically just calls the python script to generate and serialize the HTFs and then a separate script to run the repex simulation. You can also find the pickled topology proposals for Openmm 7.7.
We need to double check, this was probably solved when we merged the repex tests.
I think we already sped things up a bit using this PR: https://github.com/choderalab/perses/pull/1065 There may be room for more speed up, but I think its fine for now. What do you think? @ijpulidos @mikemhenry