perses icon indicating copy to clipboard operation
perses copied to clipboard

Speed up protein mutation charge changing repex consistency test

Open zhang-ivy opened this issue 2 years ago • 28 comments

This test is currently very slow, so we need to speed up.

See: https://github.com/choderalab/perses/issues/1044#issuecomment-1165575916

zhang-ivy avatar Jun 24 '22 17:06 zhang-ivy

@jchodera @ijpulidos : Here's the log from the last 2 iterations of running ALA->ARG (charge changing):

*******************************
DEBUG:openmmtools.multistate.multistatesampler:Iteration 4999/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2adef5fe7dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took    0.003s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 7092/93312 attempted swaps (7.6%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took    6.890s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took    3.667s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2adef5e56ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2adef5e56f70>
DEBUG:openmmtools.multistate.multistatereporter:Iteration 4999 not on the Checkpoint Interval of 10. Sampler State not written.
DEBUG:openmmtools.utils:Storing sampler states took    0.000s
DEBUG:openmmtools.utils:Writing iteration information to storage took    0.188s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 10.748s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:10.473674, at 2022-Jun-25-01:52:35 (consuming total wall clock time 14:32:48.370056).
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:openmmtools.multistate.multistatesampler:Iteration 5000/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2adef5fe7dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took    0.003s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 8050/93312 attempted swaps (8.6%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took    6.910s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took    3.721s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2adef5e56ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2adef5e56f70>
DEBUG:openmmtools.utils:Storing sampler states took    0.008s
DEBUG:openmmtools.utils:Writing iteration information to storage took    0.215s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 10.851s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:00, at 2022-Jun-25-01:52:35 (consuming total wall clock time 14:32:48.749468).

vs the last 2 iterations of THR->ALA (neutral):

DEBUG:openmmtools.multistate.multistatesampler:Iteration 4999/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2b7a99d82dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took    0.000s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 684/3456 attempted swaps (19.8%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took    1.919s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took    0.554s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2b7a99bf3ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2b7a99bf3f70>
DEBUG:openmmtools.multistate.multistatereporter:Iteration 4999 not on the Checkpoint Interval of 10. Sampler State not written.
DEBUG:openmmtools.utils:Storing sampler states took    0.000s
DEBUG:openmmtools.utils:Writing iteration information to storage took    0.020s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 2.495s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:02.436523, at 2022-Jun-24-03:43:01 (consuming total wall clock time 3:23:02.616504).
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:openmmtools.multistate.multistatesampler:Iteration 5000/5000
DEBUG:openmmtools.multistate.multistatesampler:********************************************************************************
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x2b7a99d82dc0>
DEBUG:openmmtools.multistate.replicaexchange:Mixing replicas...
DEBUG:openmmtools.utils:Mixing of replicas took    0.000s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 796/3456 attempted swaps (23.0%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
DEBUG:mpiplus.mpiplus:Running _get_replica_move_statistics serially.
DEBUG:openmmtools.utils:Propagating all replicas took    1.943s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.
DEBUG:openmmtools.utils:Computing energy matrix took    0.556s
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x2b7a99bf3ca0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x2b7a99bf3f70>
DEBUG:openmmtools.utils:Storing sampler states took    0.005s
DEBUG:openmmtools.utils:Writing iteration information to storage took    0.053s
DEBUG:openmmtools.multistate.multistatesampler:No online analysis requested
DEBUG:openmmtools.multistate.multistatesampler:Iteration took 2.554s.
DEBUG:openmmtools.multistate.multistatesampler:Estimated completion in 0:00:00, at 2022-Jun-24-03:43:01 (consuming total wall clock time 3:23:02.735137).```

zhang-ivy avatar Jun 27 '22 21:06 zhang-ivy

I just analyzed the dipeptide charge changing test data: https://github.com/choderalab/perses/issues/1044#issuecomment-1167913833

zhang-ivy avatar Jun 27 '22 21:06 zhang-ivy

@ijpulidos : This should speed the test up a lot:

move = mcmc.LangevinDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                       collision_rate=1.0 / unit.picosecond,
                                                                       n_steps=50,
                                                                       reassign_velocities=False,
                                                                       n_restart_attempts=20)

Note that we want to wait until openmmtools 0.21.5 is released so that we can use the updated LangevinDynamicsMove with the LangevinMiddleIntegrator

zhang-ivy avatar Jul 18 '22 20:07 zhang-ivy

By using openmm-nightly and LangevinDynamicsMove we are getting almost a 4x speedup for the neutral test:

neutral_77-LangSplitDynMove_vs_nightly-LangDynMove

This means that for the neutral mutation test, for our target of 3000 iterations, we are talking about ~25 minutes in a A4000 GPU (0.25 s per iteration, 3000 iterations and 2 transformations -- forward and backward --). Please note that this GPU is significantly faster than our current GPU CI workflow.

For the charge mutation (explicit solvent) the speedup is similar (considering there are more replicas/states)

charge_77-LangSplitDynMove_vs_nightly-LangDynMove

that would mean the tests would take about ~1.5 hours for our target of 1000 iterations on a A4000 GPU (1.2 s per iteration, 1000 iterations and 4 transformations as per the test goes)

ijpulidos avatar Aug 17 '22 01:08 ijpulidos

Great! Also, I just checked and the current tests use 3000 iterations -- Why are you using 1000 iterations? Is the test still passing (i.e. assert DDG < 6 * dDDG, f"DDG ({DDG}) is greater than 6 * dDDG ({6 * dDDG})" doesnt raise an error), even with 1000 iterations?

zhang-ivy avatar Aug 17 '22 02:08 zhang-ivy

Great! Also, I just checked and the current tests use 3000 iterations -- Why are you using 1000 iterations? Is the test still passing (i.e. assert DDG < 6 * dDDG, f"DDG ({DDG}) is greater than 6 * dDDG ({6 * dDDG})" doesnt raise an error), even with 1000 iterations?

I was thinking about 1000 iterations for the computations using your comment here. The test passed, but I just run it once, even though the numbers in that same thread seem to imply it might have issues with 1000 iterations. I guess we can bump it up to 2000 to be "safe" and it would still just be over 1 hour on the CI, which should be okay.

ijpulidos avatar Aug 17 '22 02:08 ijpulidos

Ah that comment was referring to the charge changing mutations. For the neutral mutations, you'll have to see my comment here

zhang-ivy avatar Aug 17 '22 02:08 zhang-ivy

But both of those comments are for experiments using 50 steps per iteration, instead of 250 steps (which is what we normally use in our production runs). Are you using 250 steps per iteration, or 50 steps?

zhang-ivy avatar Aug 17 '22 02:08 zhang-ivy

Ah that comment was referring to the charge changing mutations. For the neutral mutations, you'll have to see my comment here

I see, well I guess we can use 3000 iterations and it would be 1.5-2 hours on a 1080Ti (we have to check our GPU CI has a similar performance but I suspect our GPU CI is currently slower, since it is using older Tesla K40).

But both of those comments are for experiments using 50 steps per iteration, instead of 250 steps (which is what we normally use in our production runs). Are you using 250 steps per iteration, or 50 steps?

50 steps, as the tests go.

hss = HybridRepexSampler(mcmc_moves=mcmc.LangevinDynamicsMove(timestep=4.0 * unit.femtoseconds,
                                                                        collision_rate=1.0 / unit.picosecond,
                                                                        n_steps=50,
                                                                        reassign_velocities=False,
                                                                        n_restart_attempts=20,
                                                                        constraint_tolerance=1e-06),
                                    replica_mixing_scheme='swap-all',
                                    hybrid_factory=htf,
                                    online_analysis_interval=None)

ijpulidos avatar Aug 17 '22 02:08 ijpulidos

First of all, great news about the speedup from using openmm-dev and the new LangevinMiddleIntegrator.

I'm a bit surprised at why there is an 8x difference in iteration time between the charge and neutral tests. I think there are more thermodynamic states in the charge test, but any ideas what might lead to the 8x difference in iteration time?

jchodera avatar Aug 17 '22 02:08 jchodera

I see, well I guess we can use 3000 iterations and it would be 1.5-2 hours on a 1080Ti (we have to check our GPU CI has a similar performance but I suspect our GPU CI is currently slower, since it is using older Tesla K40).

Sounds good!

I'm a bit surprised at why there is an 8x difference in iteration time between the charge and neutral tests. I think there are more thermodynamic states in the charge test, but any ideas what might lead to the 8x difference in iteration time?

The neutral test uses 12 replicas, the charge test use 36 replicas. Additionally, the neutral test only requires two transformations (A->T, T->A), whereas charge test requires 4. I think the 8x difference makes sense given the extra thermodynamic states and transformations.

zhang-ivy avatar Aug 17 '22 02:08 zhang-ivy

Oh.. I see the problem. Its an 8x difference per iteration (without considering the extra transformations)...

zhang-ivy avatar Aug 17 '22 02:08 zhang-ivy

Interestingly, when comparing the neutral vs charge test iteration time with the LangevinSplittingDynamicsMove and openmm 7.7, charge is in fact 3x slower than neutral.

For LangevinDynamicsMove + openmm nightly, I wonder if the GPUs being used for neutral vs charge are different? @ijpulidos Any other ideas why the charge is more than 3x slower than the neutral?

zhang-ivy avatar Aug 17 '22 02:08 zhang-ivy

@zhang-ivy Ah good catch, I just checked and I did use a wrong node in one of the tests (I misread ls-gpu nodes on Lilac to be 1080ti and these are only 1080). This only affects the first plot (comparison between neutral tests). I'll re-run this test on the correct node and update the plot once I get the results.

Comparing neutral vs charge, we have to consider that the tests currently use 12 replicas for neutral and 36 for charge (which also uses explicit solvent), that may help explaining the differences in iteration times.

ijpulidos avatar Aug 17 '22 02:08 ijpulidos

Comparing neutral vs charge, we have to consider that the tests currently use 12 replicas for neutral and 36 for charge (which also uses explicit solvent), that may help explaining the differences in iteration times.

This would be responsible for a 3x difference, but it's hard to imagine the 1080Ti vs 1080 is responsible for the remaining remaining ~2.7x iteration time difference.

I've made a few comments on https://github.com/choderalab/perses/pull/1065 to suggest some ways we might be able to speed things up a bit.

jchodera avatar Aug 17 '22 03:08 jchodera

This would be responsible for a 3x difference, but it's hard to imagine the 1080Ti vs 1080 is responsible for the remaining remaining ~2.7x iteration time difference.

Yeah, that doesn't explain the remaining difference, I just updated the results.

I've made a few comments on #1065 to suggest some ways we might be able to speed things up a bit.

Will take a look at those and let you know once I get the chance to test them. Thanks for taking the time to look into this!

ijpulidos avatar Aug 17 '22 03:08 ijpulidos

I've made a few comments on https://github.com/choderalab/perses/pull/1065 to suggest some ways we might be able to speed things up a bit.

@jchodera : Thanks for the feedback! I just responded to all your comments.

Yeah, that doesn't explain the remaining difference

I just double checked my perses barnase:barstar paper data -- I ran A->T dipeptide (12 replicas) and R->A dipeptide (24 replicas), both in solvent, using 2 gpus, and am seeing that A->T takes ~0.68 sec per iteration and R->A takes ~1.5 sec per iteration on lu nodes (GeForce RTX 2080). Note this is with LangevinDynamicsMove (using middle integrator) and openmm 8.0beta (not the nightly dev).

@ijpulidos : Can you try re-running all the tests? And if you are still seeing an 8-fold difference, let's set up a quick chat to discuss what might be causing the difference as I am not seeing the same slowdown in my data.

zhang-ivy avatar Aug 17 '22 12:08 zhang-ivy

@ijpulidos : Can you try re-running all the tests? And if you are still seeing an 8-fold difference, let's set up a quick chat to discuss what might be causing the difference as I am not seeing the same slowdown in my data.

Please refer to my previous comment for the updated information https://github.com/choderalab/perses/issues/1058#issuecomment-1217359114 . I settled on using my local A4000 GPU due to some replicability issues when using lilac 1080ti nodes.

I re-run the tests and updated the plots. Basically we were facing the problem that @zhang-ivy has already warned us about in the last part of https://github.com/choderalab/perses/pull/1054#issuecomment-1178224781

We have a slowdown problem when we generate the HTF and run the simulation in the same script, this was causing the extra overhead. To avoid this I had to save the HTFs to pickle files and then load them to run the simulation in a separate script. We don't know why this is happening but we will try to debug this further on some other issue report.

Basically we can now verify the consistency on the updated plots where you get a ~4x iteration time difference when comparing the two tests (that is comparing between the same colored lines accross the two plots). We need to refactor the tests to account for things discussed in this thread.

ijpulidos avatar Aug 17 '22 23:08 ijpulidos

To address the slowdown problem, we don't actually want to run separate scripts for the test, so Iván and I discussed that the next short term action item is to see if generating the htf, saving the htf as a pickle file, deleting the htf, and then loading it in from the pickle file prevents the slowdown -- if it does, we can adapt the tests to do this. And then we can open an issue to investigate this issue further in the long term.

zhang-ivy avatar Aug 17 '22 23:08 zhang-ivy

Wow, this is bizarre. Can someone point me to the code that sets up the HybridTopologyFactory that exhibits this weird behavior? I wonder if it's leaving the ContextCaches in a weird state that might be forcing them to use the CPU platform some of the time.

To address the slowdown problem, we don't actually want to run separate scripts for the test, so Iván and I discussed that the next short term action item is to see if generating the htf, saving the htf as a pickle file, deleting the htf, and then loading it in from the pickle file prevents the slowdown -- if it does, we can adapt the tests to do this. And then we can open an issue to investigate this issue further in the long term.

This sounds like a reasonable approach to get this out the door---we can fix this with #1084.

jchodera avatar Aug 18 '22 00:08 jchodera

Can someone point me to the code that sets up the HybridTopologyFactory that exhibits this weird behavior?

Here's the neutral test: https://github.com/choderalab/perses/blob/main/perses/tests/test_repex.py#L6-L84 You should be able to reproduce the problem with this test.

zhang-ivy avatar Aug 18 '22 00:08 zhang-ivy

@zhang-ivy : I just tried running that test, but it doesn't seem to actually run a simulation. EDIT: I figured it out! My OpenEye license on lilac had expired, causing an error whose stdout/stderr error message was eaten by pytest! Adding -s will skip the capturing of stdout/stderr.

jchodera avatar Aug 18 '22 03:08 jchodera

OK, I'm running this on an A40, and have noticed a few things:

  1. The GPU is definitely in use during the setup phase. This is likely using one or more ContextCaches
  2. After hss.setup(), we then swap this out with two new empty ContextCaches without cleaning up the old one
  3. When it runs, it hits a ton of NaNs and has to keep doing integrator restarts, which will slow things down a lot:
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
...

jchodera avatar Aug 18 '22 03:08 jchodera

@ijpulidos: Can you remind me how I can get pytest to have this test generate per-iteration verbose output of timing information? I'm not having any luck with combinations like

$ pytest -s --verbose --log-level=DEBUG --log-cli-level=DEBUG perses/tests/test_repex.py::test_RESTCapableHybridTopologyFactory_repex_neutral_mutation

jchodera avatar Aug 18 '22 03:08 jchodera

I managed to get the iteration info to print by running the test as __main__ and inserting

    import logging
    LOGLEVEL="DEBUG"
    _logger = logging.getLogger()
    _logger.setLevel(LOGLEVEL)

The current main version runs each iteration at ~1.4 s/iteration for the first phase and ~1.4 s/iteration for the second phase.

However, I don't have something to compare it to where I serialize out the state and reload it in a different process to see it get faster by >2x. Do you have an example of that?

jchodera avatar Aug 18 '22 04:08 jchodera

The GPU is definitely in use during the setup phase. This is likely using one or more ContextCaches

Yes, Iván and I discussed this yesterday -- we are wondering whether its the geometry engine, which doesn't actually use a ContextCache, just a context: https://github.com/choderalab/perses/blob/main/perses/rjmc/geometry.py#L453-L463

The only other place that uses contexts in the PointMutationExecutor pipeline is when we validate endstate energies, but that part is disabled for the test, so I think its the geometry engine

zhang-ivy avatar Aug 18 '22 12:08 zhang-ivy

However, I don't have something to compare it to where I serialize out the state and reload it in a different process to see it get faster by >2x. Do you have an example of that?

You can generate the htf using: https://github.com/choderalab/perses/blob/main/perses/tests/test_repex.py#L34-L44 Then save a pickled htf using:

with open(os.path.join(outdir, f"name.pickle"), "wb") as f:
        pickle.dump(htf, f)

Then run the test, skipping the htf generation and instead loading in the pickled htf using:

htf = pickle.load(open(os.path.join(outdir, f"name.pickle"), "rb" ))

I don't have an example pickled htf, but maybe Iván does.

zhang-ivy avatar Aug 18 '22 12:08 zhang-ivy

However, I don't have something to compare it to where I serialize out the state and reload it in a different process to see it get faster by >2x. Do you have an example of that?

In the attached tar.gz file you can find a bash script to run the tests. It basically just calls the python script to generate and serialize the HTFs and then a separate script to run the repex simulation. You can also find the pickled topology proposals for Openmm 7.7.

neutral_mutation_repex_test.tar.gz

ijpulidos avatar Aug 18 '22 15:08 ijpulidos

We need to double check, this was probably solved when we merged the repex tests.

ijpulidos avatar Jul 25 '23 15:07 ijpulidos

I think we already sped things up a bit using this PR: https://github.com/choderalab/perses/pull/1065 There may be room for more speed up, but I think its fine for now. What do you think? @ijpulidos @mikemhenry

zhang-ivy avatar Jul 25 '23 15:07 zhang-ivy