msmbuilder
msmbuilder copied to clipboard
[ENH] Calculating uncertainty in MFPT
Could probably be useful.
There are directions you can take
- bootstrap
- analytic asymptotic variances
For the bootstrap, I'm not entirely convinced about the 'right' way to do it.
- You could boostrap over the MD trajectories and for each resampling run a new clustering, etc.
- You could boostrap over the discretized trajectories.
- You could boostrap over the counts, considering each count as a separate data point.
For the analytic asymptotic variances, there's a little math to do, but it's not too bad. It depends a little bit on whether you want to use the rate matrix or the transition matrix codes. For this purpose, the two are a little bit different.
For the discrete time MSM, the asymptotic variances are approximated without considering the reversibility constraint, and the parameters are taken to be just the $n^2$ entries of T individually -- this is how Nina's paper works, and the current code in uncertainty_eigenvalues and uncertainty_timescales.
The general formula (multivariate delta theorem) for CLT-style analytic asymptotic variances in a calculated quantity $g$ (like a timescale or an MFPT) is $$ \Var(g) = \sum_{uv} \frac{\partial g}{\partial \theta_u} \Sigma_{uv} \frac{\partial g}{\partial \theta_v} $$ where $\theta$ are the free parameters of the model (the independent variables that you work with to find the MLE), and $\Sigma$ is the variance-covariance matrix of \theta, which you get from the inverse of the hessian of the objective function at the MLE.
For the both the rate matrix and the transition matrix, we have an estimator for $\Sigma$. Since $u$ and $v$ index over essentially all O(n^2) entries in the transition/rate matrix, note that $\Sigma$ is really of size n^2 x n^2. But because the Nina's paper and the MSM code ignore reversibility, the rows of the transition matrix are estimated individually (no covariance) so one of the dimensions kind of drops out (e.g. Eq 16 of Hinrichs and Pande 2007). (IMO this is one of the reasons to prefer the rate matrix formulation).
So the key step is to calculate $\frac{\partial g}{\partial \theta_u}$, which for this case is the derivative of a particular MFPT with respect to one of the model's independent variables (either an entry in the transition matrix for MSM, or the \theta vector for the rate matrix).
Lemma 11.2 and Theorem 11.16 here show the calculation of the MFPT. The next step is to chug through the chain rule to calculate the derivatives.
I went through the algebra for the rate matrix http://stanford.edu/~rmcgibbo/mfpt-uncertainty.pdf, https://gist.github.com/rmcgibbo/615543f5ef57c55e5fca
lol I didn't even get a chance to look at Chapter 11....
Your math for the rate matrix case LGTM.
As far as bootstrapping goes, I would be inclined to choose bootstrapping over discretized trajectories, since that's already how I'd be calculating error bars for the ITs.
I can help with implementing the analytical solutions in msmbuilder.
For the discrete-time case, the notation is a little unfamiliar, but I think it's derived here: http://www.sciencedirect.com/science/article/pii/S0024379511006185 (Theorem 4.3), which gives $\partial MFPT(P) / \partial P$, where $P$ is the transition matrix.
I can help with implementing the analytical solutions in msmbuilder.
Cool. I'll let you take the lead with the implementation, if you don't mind. I can give advice and help with testing.