msmbuilder-legacy
msmbuilder-legacy copied to clipboard
[WIP] MCMC sampling of rates and Tprobs
This is to give people a feel for the interface. At this point, mcmc.TransitionSampler and mcmc.RateSampler are working.
Here is the driver code:
import pymc
import numpy as np
from msmbuilder import mcmc
num_states = 3
alpha = 1
C = np.ones((num_states,num_states))
C[0,0] = 1000
sampler = mcmc.TransitionSampler(C)
sampler.sample(50000,thin=20)
Ti = []
for T in sampler.iterate_T():
Ti.append(T)
Ti = np.array(Ti)
T = Ti.mean(0)
sampler = mcmc.RateSampler(C)
sampler.sample(50000,thin=20)
Ki = []
for K in sampler.iterate_K():
Ki.append(K)
Ki = np.array(Ki)
K = Ki.mean(0)
The reason I use the iterate_K() function is that we have to handle two cases:
- The samples all fit in memory.
- The samples are too big for in-memory
I'll add more comments and docstrings later. It might be useful for people to just ask questions about what isn't intuitive about the PyMC interface.
Looks cool. What is the 'thin' kwarg supposed to do?
Will this only work for dense matrices? Do you think this is going to be possible for a 10,000 state model?
thin is how much to subsample the MCMC samples to reduce correlation.
The current implementation will only work for dense models. I have ideas about how to scale things up better, but it's definetely going to be a challenge. I think there are open questions about what priors to use etc.
For example, transition matrices are never sparse, so we shouldn't really be assuming sparsity during estimation. This leads to huge problem with 10,000 state models.
I think things might be better for rate matrices, but then we have the issue of matrix exponentiation.
Also: the current code uses an Uninformative prior, not a Dirichlet prior. I'm still figuring out the best way to do a Dirichlet prior. The issue is that I need to "stack" a bunch of Dirichlet random variables, but PyMC doesn't have a very elegant way to do this.
I think I might end up just using an Uninformative prior and throwing the prior on as an external "potential".
Speed is definitely an issue here--these chains are super slow to converge...
How slow do you mean?
Waiting a day, probably isn't a big deal. Can you make a parallel version?
It seems like we need ~1-2 minutes to converge for small matrices. I think the code is still worth having, as it's simple and easy to understand. Some error analysis is better than none...
Agreed.
-Robert Sent from my iPhone.
On Fri, Apr 26, 2013 at 3:31 PM, kyleabeauchamp [email protected] wrote:
It seems like we need ~1-2 minutes to converge for small matrices. I think the code is still worth having, as it's simple and easy to understand. Some error analysis is better than none...
Reply to this email directly or view it on GitHub: https://github.com/SimTk/msmbuilder/pull/187#issuecomment-17103731
So for estimating rates, it might be natural to use an exponential distribution as a prior. This should place maximum prior weight at Kij = 0, inducing sparsity in the rate matrix.
We would then have a single hyperparameter, $\lambda$, that would be shared for all the rates. We could estimate $\lambda$ by cross validation.
I can't remember if I've ever seen this in the literature, but it seems like the right thing to do.
I just checked in a revised codebase that appears to be working. Here is a driver program that checks all 4 types of model:
import pymc
import numpy as np
from msmbuilder import mcmc, MSMLib
import scipy.linalg
num_states = 2
alpha = 1
C = 100 * np.ones((num_states,num_states))
C[0,0] *= 1000
C[0,1] += 1
T0 = MSMLib.estimate_transition_matrix(C)
X0 = MSMLib.mle_reversible_count_matrix(scipy.sparse.csr_matrix(C)).toarray()
T0_rev = MSMLib.estimate_transition_matrix(X0)
sampler = mcmc.TransitionSampler(C)
sampler.sample(1000000,thin=500)
T = sampler.mcmc.trace("T")[:].mean(0)
T_sigma = sampler.mcmc.trace("T")[:].std(0)
sampler = mcmc.RateSampler(C)
sampler.sample(1000000,thin=500)
T_K = sampler.mcmc.trace("T")[:].mean(0)
T_K_sigma = sampler.mcmc.trace("T")[:].std(0)
sampler = mcmc.ReversibleTransitionSampler(C)
sampler.sample(1000000,thin=500)
T_rev = sampler.mcmc.trace("T")[:].mean(0)
T_rev_sigma = sampler.mcmc.trace("T")[:].std(0)
sampler = mcmc.ReversibleRateSampler(C)
sampler.sample(1000000,thin=500)
T_K_rev = sampler.mcmc.trace("T")[:].mean(0)
T_K_rev_sigma = sampler.mcmc.trace("T")[:].std(0)
So I'm finding some differences between the different models. The differences are most obvious in the few-count regime, as we would expect. Thus, we have an interesting question: what is the right answer? I suspect there might not be a "right" answer.
Hm, there might still be something funny with the reversible rate sampler. I'm looking into it.
I think I fixed the issue with the ReversibleRateSampler. I needed to discard the early samples (burn-in). I added this feature and now I think things are working much better.
I also added tests, which are disabled by default due to slowness.
Here is an updated driver program:
import numpy as np
import scipy.sparse
from msmbuilder import mcmc, MSMLib
num_states = 2
alpha = 1
num_steps = 1000000
thin = 500
burn = 50000
C = 1000000 * np.ones((num_states,num_states))
C[0,0] *= 1000
C[0,1] += 1
T0 = MSMLib.estimate_transition_matrix(C)
X0 = MSMLib.mle_reversible_count_matrix(scipy.sparse.csr_matrix(C)).toarray()
T0_rev = MSMLib.estimate_transition_matrix(X0)
sampler = mcmc.TransitionSampler(C)
sampler.sample(num_steps,thin=thin, burn=burn)
T = sampler.mcmc.trace("T")[:].mean(0)
T_sigma = sampler.mcmc.trace("T")[:].std(0)
sampler = mcmc.RateSampler(C)
sampler.sample(num_steps,thin=thin, burn=burn)
T_K = sampler.mcmc.trace("T")[:].mean(0)
T_K_sigma = sampler.mcmc.trace("T")[:].std(0)
sampler = mcmc.ReversibleTransitionSampler(C)
sampler.sample(num_steps,thin=thin, burn=burn)
T_rev = sampler.mcmc.trace("T")[:].mean(0)
T_rev_sigma = sampler.mcmc.trace("T")[:].std(0)
sampler = mcmc.ReversibleRateSampler(C)
sampler.sample(num_steps,thin=thin, burn=burn)
T_K_rev = sampler.mcmc.trace("T")[:].mean(0)
T_K_rev_sigma = sampler.mcmc.trace("T")[:].std(0)
T0 - T
T0 - T_K
T0_rev - T_rev
T0_rev - T_K_rev
I think this is almost ready, so people can comment on what else they want to see.
Here are some things that I didn't implement:
Estimation with a fixed value of equilibrium populations. Sparse estimation. Estimation with a fixed sparsity structure.
These are considerably more difficult, but do-able. We can implement them as people find it necessary.
So I have been experimenting with NumExpr for speeding up likelihood calculations (this is for my ensemble stuff, but it would also apply here and to MLE code in MSMB). For expressions involving vectorized logarithms and algebra, it's trivial to get a ~3X speedup by just throwing in something like
Sweet. numpexpr is already a dependency, via tables, so this is all upside.
-Robert
On May 14, 2013, at 1:44 PM, kyleabeauchamp wrote:
So I have been experimenting with NumExpr for speeding up likelihood calculations (this is for my ensemble stuff, but it would also apply here and to MLE code in MSMB). For expressions involving vectorized logarithms and algebra, it's trivial to get a ~3X speedup by just throwing in something like
— Reply to this email directly or view it on GitHub.
@kyleabeauchamp, is this identical to what's proposed in J.C.P 128, 244103 (2008). If not, what are the differences?
Similar, but not identical. I think pymc uses normally distributed proposal steps, with a proposal scale that is tuned to optimize the acceptance rate.
In Noe's paper, he analytically calculates a bunch of proposal distributions.
For a model that is on the boundary of the feasible space, I bet Frank's stuff works better. For models in the interior, I bet ours might be better. I'm not sure--I'm not an expert.
I think the real value in my method is simplicity. All I had to do was make find a parameterization with the correct restraints (detailed balance, normalization, rate normalization, etc), but pymc does the rest.
We could possibly get performance enhancements by using a Gibbs sampling type approach, where we manually work out how we propose samples. That would be more work...
But it's sampling the same distributions. The difference, you're saying, is just efficiency?
Well, there's also the question of prior.
For a general transition matrix, the dirichlet prior is pretty standard. I don't think it's as clear-cut how to parameterize the prior for constrained models or for rate models.
Fair.
On May 27, 2013, at 8:34 PM, kyleabeauchamp wrote:
Well, there's also the question of prior.
For a general transition matrix, the dirichlet prior is pretty standard. I don't think it's as clear-cut how to parameterize the prior for constrained models or for rate models.
— Reply to this email directly or view it on GitHub.
I think there's a lot that could be done to generalize and accelerate this code, but I think it might be wise to wait to see how we plan to build models in the future. In particular, large sparse models are going to be a pain in the ass. If we can avoid them using some sort of TICA trick, then we can really use MCMC throughout the pipeline.
Yeah, I'm okay with slow code really. I think we want flexibility and generality first. We can deal with speed later.
-Robert
On May 27, 2013, at 8:44 PM, kyleabeauchamp wrote:
I think there's a lot that could be done to generalize and accelerate this code, but I think it might be wise to wait to see how we plan to build models in the future. In particular, large sparse models are going to be a pain in the ass. If we can avoid them using some sort of TICA trick, then we can really use MCMC throughout the pipeline.
— Reply to this email directly or view it on GitHub.
Regarding speed, there are several things that might be rate limilting, depending on what regime we work in:
- Large numbers of pymc
Potentialfunction calls (fix with Cython?) - Large number of
exp()calls with small inputs (fix with cython) exp()calls with large inputs (fix with numexpr)- Dense or sparse matrix math.
OK so it appears that I will likely need to rewrite this code to better handle NANs from nonzero counts.
zero counts, rather.
Argh. So I think there's a bug in pymc that won't let you initialize a Dirichlet RV at a boundary point. If you have alpha=1 (e.g. uniform), then that point is perfectly valid. The point is that we therefore cannot use the unregularized MLE as the starting point for MCMC.