msmbuilder-legacy
msmbuilder-legacy copied to clipboard
MSMExplorer TPT issues
Hi, I recently noticed 2 potential bugs in the TPT function of MSMExplorer:
- Comparing the top pathways generated by MSMExplorer to those found with tpt.py, I see a divergence after the first pathway.
- When I load my transition matrix, I'm also asked if I also want to import an equilibrium populations file. When I run TPT with and without using such a file, using the same transition matrix, I get DIFFERENT TOP PATHWAYS. This definitely seems like a bug, since the pathway flux should depend only on the transition matrix.
Here's the first two top pathways I found with tpt.py: 19:43:36 - 1 | [28, 10, 2] | (28, 10) | 1.49127195464e-10 19:43:36 - 2 | [28, 16, 10, 2] | (28, 16) | 1.49121414914e-10
MSMExplorer with equilibrium populations file, top 2 pathways:
MSMExplorer without equilibrium populations file, top 2 pathways:
Some help on this would be great. Thanks! Jade
@jadeshi thanks. I should note here, as the author of the MSMBuilder TPT code, that it could definitely have bugs.
We should search for a reference implementation we trust or devise a few test cases with known answers.
PS I started a discussion of analytical test cases here:#243
It has been some time since I reviewed TPT, but I believe the implementation of TPT in MSMExplorer is based on this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2832055/
The statement of the reactive flux matrix (see, for example, equation 9) stated therein does use equilibrium probabilities, so on that basis it's not clear to me that the issue the OP identifies is in fact obviously incorrect behavior. Can someone clarify why this is wrong, if it is wrong?
On the issue of divergence with msmbuilder results, my recollection was that while the reactive flux matrix is precisely defined, the process for extracting top paths was in fact unspecified by TPT in the sense that an implementor could chose whichever path search algorithm they desired. This will certainly result in different rankings of paths and potentially different paths. Msmbuilder used to use a greedy recursive backtracking algorithm; MSMExplorer currently uses an implementation of Dijkstra, which it switched to when msmbuilder switched to Dijkstra.
Agreed. AFAIK, the pathway decomposition is not a physical observable and is therefore not uniquely determined.
With respect to the second issue I raised, what I find strange is if I use just a transition matrix and specify no equilibrium populations file, I get one set of pathways, whereas if I specify both a transition matrix and equilibrium probabilities file (calculated with msmbuilder's msm_analysis.get_eigenvectors), I get a different set of pathways. I'm a bit confused as to why this could be.
Yeah, that part is confusing.
Hmm, Jade and I were trying to unsuccessfully debug this problem this after noon with a custom 5*5 matrix but we were not able to pin point where the bug is.
@brycecr Why do you think different path searching algorithms would give different top paths on the same transition matrix? I can only assume that MSM Builder and MSM Explorer are using different definitions for the "distance" between the two states for that to be the case.
Given equation #9 from the paper references, the equilibrium probabilities are used to calculate the reactive flux matrix. Thus, on line 300 (or so, depending exactly which commit you have) of TPTFactoryCM.java, we have:
fFluxes.setEntry(i, j, eqProbs[i]*bCommittors[i]*m_tProb.getEntry(i,j)*fCommittors[j]);
That is, the equilibrium probabilities are directly used in the calculation. If you do not specify an equilibrium probabilities file to MSMExplorer, it assumes all states in the model have the same equilibrium probability. Thus, different values are used in flux calculation.
If this is unexpected or strange, as it seems to be, then I agree it is a bug. But I'm not clear why this is strange at this point so I'm not sure how to approach a solution. In particular, are you suggesting that there is a way to calculate a reactive flux matrix that does not depend on equilibrium probabilities, such that the same results should be obtained whether the file is defined or not, or should MSMExplorer warn or refuse to run TPT without a specified equilibrium probabilities file?
@msultan If you are interested in the top path or top two paths, for instance, how do you know which path has the top reactive flux? We had a very simple algorithm in msmbuilder at one point which simply started at the target and traveled edges backwards, simply taking the highest flux path at every step. If you apply a similar algorithm forward on the graph below, you will get different ordering of top paths then if you apply it backwards:
That's the basic principle of why you might see different top paths from different algorithms. In MSMExplorer, we currently use Dijkstra's algorithm with the modification that once we find a path, we subtract the reactive flux along the path from the matrix. I don't claim this returns definitely returns the best or most meaningful (or even bug-free) paths, but it was at one point mostly consistent with msmbuilder in testing.
Thanks for looking into this, folks!
Three quick comments:
(1) I there is an optimal solution to the path-finding problem, given a net flux matrix. I believe this contradicts what @kyleabeauchamp was saying above. So MSMBuilder and MSMExplorer should be giving the same solutions -- it's not a matter of convention. We will also want to check the net flux calculations are consistent.
(2) @brycecr : note that the eq pops are the first (left) eigenvector of the transition matrix -- so if not provided, the eq populations can be computed directly from the transition matrix without too much expense. This is definitely desirable -- the assumption all states have equal eq. prob is likely pretty bad.
(3) I have volunteered to help look into this, but if people discover stuff, continue to post up here!
PS. With regards to (1): if I remember right, the optimal algorithm should be to "cut" the top path's bottleneck (node-to-node connection along the path carrying the least current) before finding the next path. Is this what you are doing @brycecr ?
@tjlane
Yes, I'm cutting the bottleneck and subtracting the bottleneck flux out of all edges along the path. There's a note in the code that claims this should be optimal, so you're probably right on optimality. I'm not sure that I'm using the correct cost heuristic for dijkstra to be optimal for what we want, however; it should be max bottleneck flux, I think, but right now it's a sum of path edge fluxes of some sort.
Good point on calculating the equilibrium probabilities. Should be easy to do generally in MSMExplorer, although replacing all places where that file is requested will take a bit more & so might take me a little while to get to.
Can someone, perhaps @jadeshi, give me a tprob & eqProb file for which msmbuilder and msmexplorer diverge so I can play with the two and try to get consistent results?
Also, I assume this is on msmbuilder2?
@brycecr I think we are computing slightly different things. I was just "cutting" the bottleneck -- leaving all the other edges along the path unaffected.
The correct method should probably preserve the following desirable property: for a network with N paths, the sum off all paths should give the total net flux from the sources to the sinks.
The bottleneck cutting method retains this property, I believe, while the path-flux subtraction approach does not: for N pathways, each carries a current proportional to the inverse of it's resistance (aka bottleneck value) -- so the total flux from sources to sinks will be the sum of all the bottleneck fluxes.
Consider the two algorithms on the following simple example (which should become a unit test) to convince yourself of this:
Nodes: A, B, C, D. A = source, D = sink.
Edges / Flux
A -> B : 0.5
A -> C : 1.0
B -> D : 2.0
C -> B : 0.9
C -> D : 0.8
If I transcribed that properly, it should highlight the differences between the two methods.
@tjlane , when you say the sum of all paths, does that preclude a constant matrix that contains no paths? I see your point, I think, on the example you provide, but that graph has the, to me, odd property that flux entering and leaving nodes is unequal -- more leaves then enters C, for example. Is that observed in models?
I believe the reason I followed the path subtraction approach in MSMExplorer is the suggestion in this paper: http://biocomputing.mi.fu-berlin.de/publications/MeScVa07.pdf in the discussion of residuum current at equation 2.45. Perhaps I am misunderstanding the terms here, but i read the suggestion from the paragraph following equation 2.45 to suggest that subtracting the bottleneck flux from each edge gives a matrix from which the next path can be obtained. Again, I didn't read this in great detail and may have misunderstood, but that's where MSMExplorer's behavior comes from
@brycecr good point -- yes typically the flux into and out of each node should be equal. So scrap my example!
I need to re-read this paper, it's been a few years (!). But at a glance it looks like they are suggesting your algorithm. Case not closed, but I (now) suspect yours is correct and the one I coded into MSMb is currently not.
I'm finding that CalculateTPT.py and tpt.py yield different net_flux matrices. tpt.py's seems to be correct though.
@cxhernandez what commands exactly are you running?
CalculateTPT.py
looks like it's calling tpt.py
, so I'm a bit confused!
Also posting this here, since it seems like a lot of people are working on/using TPT:
What constitutes a "path" in TPT is not precisely defined. It depends on what one wants to define as a "path". Currently, MSMBuilder and MSMExplorer do two different things.
I am thinking of switching MSMBuilder's path-finding algorithm over to MSMExplorer's since that algo has a few nice properties the current one doesn't, namely the sum of the flux of all paths found will equal the total flux from the sources to the sinks, and second the paths returned will be more diverse.
If you have done any work on coding a new path finding algorithm, please post here so I know. Otherwise I'll implement the MSMExplorer version in MSMBuilder, and people can use whichever they want.
Also, as a generic note, the paths from EITHER algorithm do not necessarily give you "representative" pathways that you might observe e.g. during dynamics. For example, consider an off-pathway intermediate in folding
I_off
\
U -- I_on -- F
The state I_off
will never be included in ANY path, even though it is clearly important for understanding the dynamics of this system. I'd recommend people actually read the TPT papers before using the algorithms.
False alarm! I mixed up calculate_net_fluxes with calculate_fluxes...
I agree with integrating Metzner into MSMBuilder. I think from a physical standpoint the concept of residuum current makes a lot of sense, and in this respect the current MSMBuilder pathfinding algorithm could potentially overestimate fluxes when calculating top pathways. Also, in practice, I found that Metzner gave a more comprehensive set of pathways.
@jadeshi good to know @cxhernandez owes me coffee
@cxhernandez not really : P
I think making the switch to the MSMX version makes sense.
Sent from my Phone. Sorry for the brevity or unusual tone.
On Oct 31, 2013, at 2:19 PM, TJ Lane [email protected] wrote:
Also posting this here, since it seems like a lot of people are working on/using TPT:
What constitutes a "path" in TPT is not precisely defined. It depends on what one wants to define as a "path". Currently, MSMBuilder and MSMExplorer do two different things.
I am thinking of switching MSMBuilder's path-finding algorithm over to MSMExplorer's since that algo has a few nice properties the current one doesn't, namely the sum of the flux of all paths found will equal the total flux from the sources to the sinks, and second the paths returned will be more diverse.
If you have done any work on coding a new path finding algorithm, please post here so I know. Otherwise I'll implement the MSMExplorer version in MSMBuilder, and people can use whichever they want.
Also, as a generic note, the paths from EITHER algorithm do not necessarily give you "representative" pathways that you might observe e.g. during dynamics. For example, consider an off-pathway intermediate in folding
I_off \
U -- I_on -- F The state I_off will never be included in ANY path, even though it is clearly important for understanding the dynamics of this system. I'd recommend people actually read the TPT papers before using the algorithms.
— Reply to this email directly or view it on GitHub.