PyEMMA icon indicating copy to clipboard operation
PyEMMA copied to clipboard

Tips on combining features across systems

Open cgseitz opened this issue 2 years ago • 6 comments

Hello,

I have two systems, A and B. The only differences between the two are a few point mutations. After much analysis of the MD trajectories, I can say that these mutations do not appreciably change the structure nor the dynamics. Both systems access two states, let's call them open and closed. I want to find the transition kinetics between these two states, compared between my two systems. From the ITS I can see what lag times I need for my systems. At this lag time, I have gotten stuck. Here are the MSMs that I have been able to create:

  1. MSMs that only capture the open state, or only capture the closed state, but not both
  2. MSMs that capture the open and closed state for system A, but not for system B

If some features from 1. capture the open state at lag time x and other features capture the closed state at lag time x, why can't I combine those features to capture both states and see the transition between the open and closed states? If I've found features that capture the open and closed states for system A, why wouldn't the same features work for system B? If I wanted to compare the transition kinetics between these systems, do I need to use the same features for both systems, or is this irrelevant?

I can provide any more information necessary. Thanks for your help!

Best, Christian

cgseitz avatar Dec 08 '21 19:12 cgseitz

That sounds like your states are not connected, i.e., that you are not sampling both opening and closing transitions. That would be a possible reason for point 1 (MSMs give you the largest connected set, and that may be either open or closed if these are disconnected). To me it sounds like you can describe the opening/closing with the features that you've picked, but the MSM estimation does not model the process in the full state space because the process is not connecting open/closed states. Pyemma will just yield the largest connected set of states then, which may be only describing one of these two states. So maybe the same features work for both systems in terms of good observables, but what they are describing is an off-equilibrium sample that doesn't describe the process fully. Maybe you can check before the clustering if your observables show transitions in both directions, open -> closed and closed -> open in both datasets.

If I wanted to compare the transition kinetics between these systems, do I need to use the same features for both systems, or is this irrelevant?

If you want to compare the kinetics quantitatively, i.e. identify metastable states across datasets, you need to make sure that your MSMs live in the same state space. That means that all the estimation steps up to and including the clustering need to be done in the same way for both datasets. This makes sure that all your micro-states (that you get from the clustering) are the same for both MSMs. (Make sure you're taking care of the active sets of the resulting MSMs, they may be different for each data set when you bisect the discrete trajectories after the clustering.) I did something similar for the local models described in this paper but with HMMs.

thempel avatar Dec 09 '21 12:12 thempel

Ah thanks for this explanation and the paper reference, it also contains clear explanations. The only part I don't understand is when you say, "Make sure you're taking care of the active sets of the resulting MSMs, they may be different for each data set when you bisect the discrete trajectories after clustering." Are you talking about adjusting the lag time, which may need to be different based on the features I choose, even if the motion is the same? Or something else?

cgseitz avatar Dec 10 '21 18:12 cgseitz

Clustering (at least with k-means) is non-deterministic so I suggest you keep it fixed between the MSM estimations. Given that and the same featurization / dimension reduction model, each MSM is still supported by a different set of data which may not result in the same active set (as Tim pointed out). This means that the largest connected set based on the estimated count matrix is potentially a different one for each MSM [recall that MSMs in PyEMMA are only defined on a connected set of the count matrix implied connectivity graph]. ~One way of dealing with this (Tim's suggestion) is restricting the discrete states to the intersection of all active sets.~

clonker avatar Dec 13 '21 09:12 clonker

Let me give an example: Two systems A and B that have partially overlapping discrete states with trajectories dt_A = [0, 1, 2, 1, 0] and dt_B = [1, 2, 3, 2, 1]. That means your full set of states is [0, 1, 2, 3] but for system A, [0, 1, 2] is the largest connected set - this system never visits state 3. For B, the active set would be [1, 2, 3]. You can now estimate two MSMs independently, but if you want to compare them, you have to make sure that you map back to the full set description again. In this example, pyEMMA would map the active set of system B to [1, 2, 3] -> [0, 1, 2], corresponding to row- and column indices of the estimated transition matrix. (In more complex examples, the active set may be something like [0, 3, 22, ...] and you don't want to have transition matrices with empty rows/columns such that re-indexing is applied here.)

Edit: I wouldn't restrict to the intersection of all active states, sorry if I phrased it like that.

thempel avatar Dec 13 '21 16:12 thempel

That sounds like your states are not connected, i.e., that you are not sampling both opening and closing transitions. That would be a possible reason for point 1 (MSMs give you the largest connected set, and that may be either open or closed if these are disconnected). To me it sounds like you can describe the opening/closing with the features that you've picked, but the MSM estimation does not model the process in the full state space because the process is not connecting open/closed states. Pyemma will just yield the largest connected set of states then, which may be only describing one of these two states. So maybe the same features work for both systems in terms of good observables, but what they are describing is an off-equilibrium sample that doesn't describe the process fully. Maybe you can check before the clustering if your observables show transitions in both directions, open -> closed and closed -> open in both datasets.

If I wanted to compare the transition kinetics between these systems, do I need to use the same features for both systems, or is this irrelevant?

If you want to compare the kinetics quantitatively, i.e. identify metastable states across datasets, you need to make sure that your MSMs live in the same state space. That means that all the estimation steps up to and including the clustering need to be done in the same way for both datasets. This makes sure that all your micro-states (that you get from the clustering) are the same for both MSMs. (Make sure you're taking care of the active sets of the resulting MSMs, they may be different for each data set when you bisect the discrete trajectories after the clustering.) I did something similar for the local models described in this paper but with HMMs.

This paper helps me a lot. Are the corresponding codes available?

Seral17 avatar Jan 13 '22 08:01 Seral17

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 31 '22 01:07 stale[bot]