msmbuilder-legacy
msmbuilder-legacy copied to clipboard
CalculateImpliedTimescales.py crashes using assignments given by AssignHierarchical.py
Following on from the issue I raised earlier (#295), I'm having troubles with the CalculateImpliedTimescales.py script when working with assignments generated by AssignHierarchical.py. This does not seem to occur when using assignments generated using rmsd hybrid clustering.
The following is the typical output I'm getting when executing the script:
CalculateImpliedTimescales.py -a Data1/Assignments.h5 -l 1,100 -i 5 -o Data1/ImpliedTimescales.dat
--------------------------------------------------------------------------------
MSMBuilder version 2.7.dev.dev-Unknown
See file AUTHORS for a list of MSMBuilder contributors.
--------------------------------------------------------------------------------
Copyright 2011 Stanford University.
MSMBuilder comes with ABSOLUTELY NO WARRANTY.
MSMBuilder is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
--------------------------------------------------------------------------------
Please cite the following references:
GR Bowman, X Huang, and VS Pande. Methods 2009. Using generalized ensemble
simulations and Markov state models to identify conformational states.
KA Beauchamp, GR Bowman, TJ Lane, L Maibaum, IS Haque, VS Pande. JCTC 2011.
MSMBuilder2: Modeling Conformational Dynamics
at the Picosecond to Millisecond Timescale
IS Haque, KA Beauchamp, VS Pande. In preparation.
A Fast 3 x N Matrix Multiply Routine for Calculation of Protein RMSD.
--------------------------------------------------------------------------------
{'assignments': 'Data1/Assignments.h5',
'eigvals': 10,
'interval': 5,
'lagtime': '1,100',
'notrim': False,
'output': 'Data1/ImpliedTimescales.dat',
'procs': 1,
'quiet': False,
'symmetrize': 'MLE'}
21:37:54 - Getting 10 eigenvalues (timescales) for each lagtime...
21:37:54 - Building MSMs at the following lag times: [1, 6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 76, 81, 86, 91, 96]
21:37:54 - Calculating implied timescales at lagtime 1
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
...
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
Traceback (most recent call last):
File "/usr/local/bin/CalculateImpliedTimescales.py", line 5, in <module>
pkg_resources.run_script('msmbuilder==2.7.dev', 'CalculateImpliedTimescales.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 82, in <module>
(not args.notrim), args.symmetrize, args.procs)
File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 64, in run
trimming=trimming, symmetrize=symmetrize, n_procs=nProc)
File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/msmbuilder/msm_analysis.py", line 185, in get_implied_timescales
...
lags = result.get(999999)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
...
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
It simply cuts out after the last line without writing out ImpliedTimescales.dat.
So I'm not sure which part of the calculation is crashing, but this does happen sometimes.
I think the easiest workaround for now is to use fewer lagtimes or fewer states. I think this bug tends to happen more at longer lagtimes or with more states, but I'm not 100% sure.
Hmmm. I've just tried halving the number of states from ~1700 to 800, and this is what I'm getting now:
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/msmbuilder/MSMLib.py:592: RuntimeWarning: invalid value encountered in double_scalars
logger.info("Selected component %d with population %f", ComponentInd, ComponentPops[ComponentInd] / ComponentPops.sum())
10:17:31 - Selected component 0 with population nan
10:17:31 - Calculating implied timescales at lagtime 15
Traceback (most recent call last):
File "/usr/local/bin/CalculateImpliedTimescales.py", line 5, in <module>
pkg_resources.run_script('msmbuilder==2.7.dev', 'CalculateImpliedTimescales.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 82, in <module>
(not args.notrim), args.symmetrize, args.procs)
File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 64, in run
trimming=trimming, symmetrize=symmetrize, n_procs=nProc)
File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/msmbuilder/msm_analysis.py", line 185, in get_implied_timescales
lags = result.get(999999)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: invalid index
Line 5 worries me the most. Further decreasing the number of states does not seem to solve the problem, although I haven't seen any crashes yet.
Nevermind; I think I've found a solution. The script requires that only lag times which are multiples of the stride (50 in my case) be sampled. All other lag times either result in the script crashing for me or returning an "IndexError: invalid index" error.
While I've found a solution, I'm not sure that I understand the philosophy behind it...
OK, I think I know what's going on. It's actually not possible to extend a hierarchical clustering to lagtimes that are more frequent than the one used during clustering. This is because there is no concept of "generator" or "cluster center" in hierarchical clustering.
For k-centers, k-medoids, and hybrid, there IS the concept of a generator, which allows you to transfer (or apply) your clustering to new data.
Regardless, you should get a more descriptive error here, which we will fix.
Hmmmm. Some food for thought.
Thanks for the help.