PyEMMA icon indicating copy to clipboard operation
PyEMMA copied to clipboard

BHMM CK-test

Open thempel opened this issue 6 years ago • 17 comments

The CK-test for BayesianHMMs gives me a cryptic index error in the case that the original estimate is done on an active set smaller than the full set of states. I was finally able to compile a minimal example that contains a disconnected state in the observable space. It only fails with BHMMs (err_est kwarg does not matter).

import pyemma
artificial_poor_dtraj = [[0, 0, 0, 0, 0, 0], 
   [1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2]]
BMSM = pyemma.msm.bayesian_hidden_markov_model(artificial_poor_dtraj, 2, 6, reversible=True)
ck = BMSM.cktest(mlags=3, err_est=False)

I tested this on the devel branch where it prints pyemma.msm.estimators.bayesian_hmsm.BayesianHMSM[2] WARNING Ignored error during estimation: index 2 is out of bounds for axis 1 with size 2.

Compared to version 2.4, there seem to be two differences: a) In 2.4, the above example yields a more useful warning message Given active set has empty states. Maybe we should restore this error message? Otherwise, users just get an identity matrix as the new estimate. b) With my own dataset, I don't get a warning at all in 2.4 and everything works fine (not an identity matrix). Maybe the minimal example is another bug?

Any ideas?

thempel avatar Apr 12 '18 13:04 thempel

In our debug session we figured out that the samples are already restricted to the observable set, but during the final submodel call in estimate, we try to restrict again to the same set. This causes this particular index error. It is absolutely not clear why this worked before.

marscher avatar Apr 17 '18 12:04 marscher

Probably related to #1324.

thempel avatar Jul 04 '18 11:07 thempel

I am running into the CK test issue as described above with a BHMM. I'm not quite sure how to interpret this discussion in terms of my model. Does this mean that my model is somehow incorrect or I've done something wrong? Any advice is helpful.

kochaneks avatar Jun 10 '19 20:06 kochaneks

@kochaneks thanks for reporting. I think that for larger lag times, the active set collapses (e.g. contains unvisited states, and is therefore not connected anymore). This means that you can not estimate a connected transition matrix for this lag time any more. It would be nice, if you could share your dtrajs and estimator parameters with us, so we can work on this. Tim is right about restoring the actual error message, because this index error is too cryptic.

marscher avatar Jun 12 '19 10:06 marscher

@marscher thanks for your reply. I've attached the dtrajs below. I have chosen to states and a lag time of 150 steps.

To confirm that I'm understanding the issue correctly, can you confirm or correct my interpretation: The CK-test fails at long lag times because the size of the observable set is decreasing rapidly. As I have observed, this situation does not seem to be unique to BHMMs, but maybe the decrease in the set with longer lag times is more drastic in this case and leads to the failed CK-test.

In order to perform the CK-test for this system, I am considering calculating an analogous HMM (same number of states and lag time) and performing the CK-test with this. The CK-test will be without error bars, but may be good enough to demonstrate the validity of the model.

dtrajs-virus-Jacob.txt

kochaneks avatar Jun 17 '19 17:06 kochaneks

Could you please provide the dtrajs as a non-pickled format, like compressed ascii?

marscher avatar Jun 17 '19 17:06 marscher

Included below!

dtrajs.tar.gz

kochaneks avatar Jun 17 '19 17:06 kochaneks

Thanks for sharing. Could you provide the code snippet producing the error. I can compute a cktest with 5 metastable states on your data with a BayesianMSM without any warning/error.

marscher avatar Jun 18 '19 11:06 marscher

It's with a BHMM, not BMSM. I can provide the exact code if necessary.

kochaneks avatar Jun 18 '19 18:06 kochaneks

I just looked into your data. Indeed, this is the issue we are discussing here. Until it is fixed, you might want to use maximum likelihood HMMs to conduct model validation with the CK-test and to use the BHMM only if error estimates are necessary. The reason why the Bayesian model does not work in your case seems to be that by using effective counting, in the effectively available data, one of the cluster centers becomes disconnected at longer lag times. Effective counting is applied for the Bayesian model estimation in order to achieve uncorrelated samples for the error estimate. This is not done for the maximum likelihood model, thus the CK-test works. We will provide a fix soon.

thempel avatar Jun 19 '19 13:06 thempel

Thanks for your help with this! I've been able to use the maximum likelihood HMM to perform the CK-test, but unfortunately I'm a bit puzzled by the results. I explain below: When I do the CK-test for an MSM at a lag time of 500 steps, it looks very good. However, when using this same lag time for the HMM, the CK-test looks very bad. Now, I know that I should be able to choose a shorter lag time for the HMM, but it would also seem that the lag time of the MSM should also be valid. I have tried other lag times shorter than 500 steps, but to no avail. Do you have any suggestions? These models are all built with the same dtrajs as above, incase you want to try any of this out on your own.

kochaneks avatar Jun 27 '19 16:06 kochaneks

This sort of parameter optimization can be part of the optimization procedure to estimate a valid Markov model. Unfortunately, we cannot help you with that.

However, the reason why you observe this behavior might be the following: In the process of estimating an HMM, the hidden state probabilities are also learned from the data. So there is no mathematical reason why the hidden states must be the same for different lag times. In practise, they mostly are. But as HMM estimation happens at every lag time of the CK-test, in some cases you might just end up comparing transition probabilities between different sets of hidden states. In my experience, if you have reversible sampling but your rare events are very rare, this becomes less of an issue for lower lag times. So even though you are right and a high HMM lag time should be a valid choice, there are two reasons for choosing it as small as possible: 1) you want maximal time resolution of the model and 2) HMM estimation might be algorithmically more stable (also for multiples of your initial lag time in the CK-test).

thempel avatar Jun 28 '19 07:06 thempel

Thanks for your explanation. Out of curiosity, do you have any indication of when you expect a fix for the BHMM CK-test?

kochaneks avatar Jul 03 '19 00:07 kochaneks

I have a similar problem. Here's an (almost) reprex:

import pyemma
from pyemma.datasets.double_well_discrete import DoubleWell_Discrete_Data

print(pyemma.__version__)

dtraj = DoubleWell_Discrete_Data().generate_traj(10000)
print('Bayesian HMM')
mm = pyemma.msm.bayesian_hidden_markov_model(dtraj, nstates=2,  lag=50, nsamples=500)
ck_b = mm.cktest(2, n_jobs=1)

print('ML HMM')
mm = pyemma.msm.estimate_hidden_markov_model(dtraj, nstates=2,  lag=50)
ck_ml = mm.cktest(2, n_jobs=1)

Sometimes the ML model works. Most of the time it doesn't however and I get the following output:

2.5.7
Bayesian HMM
05-08-20 12:06:06 pyemma.msm.estimators.bayesian_hmsm.BayesianHMSM[83] WARNING  Ignored error during estimation: index 41 is out of bounds for axis 1 with size 41
ML HMM

/Users/robertarbon/opt/miniconda3/envs/analysis/lib/python3.6/site-packages/pyemma/msm/estimators/lagged_model_validators.py:454: RuntimeWarning: invalid value encountered in true_divide
  self.P0 /= self.P0.sum(axis=0)  # column-normalize
/Users/robertarbon/opt/miniconda3/envs/analysis/lib/python3.6/site-packages/msmtools/analysis/dense/decomposition.py:395: RuntimeWarning: divide by zero encountered in true_divide
  R = R / np.sqrt(s[np.newaxis, :])
/Users/robertarbon/opt/miniconda3/envs/analysis/lib/python3.6/site-packages/msmtools/analysis/dense/decomposition.py:395: RuntimeWarning: invalid value encountered in true_divide
  R = R / np.sqrt(s[np.newaxis, :])
/Users/robertarbon/opt/miniconda3/envs/analysis/lib/python3.6/site-packages/msmtools/analysis/dense/decomposition.py:396: RuntimeWarning: invalid value encountered in true_divide
  L = L / np.sqrt(s[np.newaxis, :])

I'm guessing the two (Bayes/ML) problems are related to the hidden state identity problem @thempel just described.

I guess one way around it is to do the CK test manually and adjust the hidden state identities manually. I'm not sure how to do this but I need to...so I'll guess I'll report back tomorrow.

RobertArbon avatar Aug 05 '20 11:08 RobertArbon

I'm not really sure if there is a good way of doing it manually. You can of course try to start the HMM optimization from the PCCA memberships estimated at your model lag time, but I'm not sure if it helps. In my eyes, keeping the hidden states identical to the initial HMM states, at least in theory, should be done in the Baum-Welch algorithm, i.e. running it and constraining the observation probabilities to the initial ones at your model lag time. I guess that's a) an unpleasant (and probably unnecessary) optimization problem, and b) it I don't think it would still be a CK-test in the pyemma spirit. In my eyes it's a bit of a conceptual problem that we don't want to bisect these two things, transition probability estimation and observation probability estimation, and at the same time compare transition probabilities between identical states.

thempel avatar Aug 05 '20 11:08 thempel

OK, I see what you're saying regarding the conceptual problem.

RobertArbon avatar Aug 05 '20 12:08 RobertArbon

OK. I think I've found a bug but I don't think it's totally related to the issue here:

    def _compute_observables(self, model, estimator, mlag=1):
        # for lag time 0 we return an identity matrix
        if mlag == 0 or model is None:
            return np.eye(self.nsets)
        # otherwise compute or predict them by model.propagate
        pk_on_set = np.zeros((self.nsets, self.nsets))
        subset = self._full2active[model.active_set]  # find subset we are now working on
        for i in range(self.nsets):
            p0 = self.P0[:, i]  # starting distribution on reference active set
            p0sub = p0[subset]  # map distribution to new active set
            p0sub /= p0sub.sum()  # renormalize
            pksub = model.propagate(p0sub, mlag)
            for j in range(self.nsets):
                pk_on_set[i, j] = np.dot(pksub, self.memberships[subset, j])  # map onto set
        return pk_on_set

the call: subset = self._full2active[model.active_set] should be subset = self._full2active[estimator.active_set]

to be consistent with _compute_observables_conf.

This is on the develop branch.

Making a PR now.

RobertArbon avatar Aug 06 '20 10:08 RobertArbon