msmbuilder-legacy icon indicating copy to clipboard operation
msmbuilder-legacy copied to clipboard

Update tICA docs

Open kyleabeauchamp opened this issue 11 years ago • 26 comments

So there's a file docs/tICA/tICA.pdf that shares a lot in common with http://msmbuilder.s3-website-us-east-1.amazonaws.com/theory/tICA.html

However, there are several key differences:

  1. HTML version is missing the actual MSMBuilder commands

I also noticed that the PDF version instructs users to download a special fork and branch, which I believe is no longer necessary, right?

kyleabeauchamp avatar Feb 05 '14 16:02 kyleabeauchamp

OK I guess what happened was that Robert split the tICA doc into theory and application but never ported the application side of things.

I wonder if it makes more sense to merge them. Not sure.

kyleabeauchamp avatar Feb 05 '14 16:02 kyleabeauchamp

Do you have a preferred way to generate the atom pairs? Otherwise, I vote that we include a tiny script inside the tutorial (probably at the very end, in a FAQ):

import itertools
import numpy as np
import mdtraj as md

trj = md.load("./system.subset.pdb")

top, bonds = trj.top.to_dataframe()
atom_indices = np.where((top.name == "CA") & (top.resSeq >= 139) & (top.resSeq <= 175))[0]
atom_pairs = list(itertools.combinations(atom_indices, 2))

np.savetxt("./AtomPairs.dat", atom_pairs, "%d")

kyleabeauchamp avatar Feb 05 '14 16:02 kyleabeauchamp

Also, I wonder if it might make sense to set stride = 1 for the "beginner" TICA tutorial. Otherwise there's just a lot of parameters exposed to users.

I know you've found the 1 / 10 stride optimal, but I wonder if it make sense to ignore that fact for illustrative purposes.

kyleabeauchamp avatar Feb 05 '14 16:02 kyleabeauchamp

Also: the "Drawbacks of TICA" section actually applies to all MSM-like forms of dimensionality reduction, so it might not be necessary.

kyleabeauchamp avatar Feb 05 '14 17:02 kyleabeauchamp

thanks for checking this out, I'm going to make these changes (hopefully later today)

schwancr avatar Feb 05 '14 17:02 schwancr

Also: IMHO eliminate the ProjectInfo.yaml inputs to scripts, as they get set by default. I think that will help people stay focused on the tICA-specific details.

kyleabeauchamp avatar Feb 05 '14 17:02 kyleabeauchamp

If you do end up merging the theory + applications into a single sphinx file, could you move the link to sit under the "Documentation" section on the main docs page?

kyleabeauchamp avatar Feb 05 '14 18:02 kyleabeauchamp

Again, let me know if you run out of time and I can file a PR for some of this stuff.

kyleabeauchamp avatar Feb 05 '14 18:02 kyleabeauchamp

The reason I split them was because the original tICA tutorial latex file I was working from had all this info about downloading a different branch, and I wasn't sure which parts were or were not relevant currently. But the theory I knew was current.

I don't think having the theory and practice pages separated is a bad idea, especially if we can put in links between them.

rmcgibbo avatar Feb 05 '14 19:02 rmcgibbo

So should I use k-centers or k-medoids when clustering my tICA results? Because we're working in the eigenvector space, I imagine that either one of the following is true:

  1. k-centers neglects equilibrium density because we're working in right eigenvector space
  2. k-medoids double-counts equilibrium density because we're working in left eigenvector space

kyleabeauchamp avatar Feb 06 '14 18:02 kyleabeauchamp

In practice, I've found that the hybrid k-medoids that msmb implements doesn't change things drastically. If you wanted to do k-means, however, you could probably gain a lot.

I don't know the rigorously right way to do it, but for instance, when I used Ward clustering, I could build a 20 state model (just from clustering) that gave me the same model (with slightly faster timescales) as building a 1,000 state model with k-centers.

On Thu, Feb 6, 2014 at 10:59 AM, kyleabeauchamp [email protected]:

So should I use k-centers or k-medoids when clustering my tICA results? Because we're working in the eigenvector space, I imagine that either one of the following is true:

  1. k-centers neglects equilibrium density because we're working in right eigenvector space
  2. k-medoids double-counts equilibrium density because we're working in left eigenvector space

Reply to this email directly or view it on GitHubhttps://github.com/SimTk/msmbuilder/issues/324#issuecomment-34357495 .

schwancr avatar Feb 06 '14 19:02 schwancr

Thanks. Have you ever done k-means? E.g. do I have to update Cluster.py?

kyleabeauchamp avatar Feb 06 '14 19:02 kyleabeauchamp

I think I answered my own question

Cluster.py tica atompairs: error: argument alg: invalid choice: 'kmeans' (choose from 'kcenters', 'hybrid', 'clarans', 'sclarans', 'hierarchical')

kyleabeauchamp avatar Feb 06 '14 19:02 kyleabeauchamp

We don't have k-means in msmbuilder. I think I tried it once on my own with scikit-learn though

On Thu, Feb 6, 2014 at 11:12 AM, kyleabeauchamp [email protected]:

Thanks. Have you ever done k-means? E.g. do I have to update Cluster.py?

Reply to this email directly or view it on GitHubhttps://github.com/SimTk/msmbuilder/issues/324#issuecomment-34358839 .

schwancr avatar Feb 06 '14 19:02 schwancr

We do have a commented-out KMeans class in clustering.py...

kyleabeauchamp avatar Feb 06 '14 19:02 kyleabeauchamp

Does it work? I didn't know that

On Thu, Feb 6, 2014 at 11:17 AM, kyleabeauchamp [email protected]:

We do have a commented-out KMeans class in clustering.py...

Reply to this email directly or view it on GitHubhttps://github.com/SimTk/msmbuilder/issues/324#issuecomment-34359380 .

schwancr avatar Feb 06 '14 19:02 schwancr

IMHO it looks highly suspicious.

kyleabeauchamp avatar Feb 06 '14 19:02 kyleabeauchamp

If we want kmeans, we should definitely just wrap sklearn.

kyleabeauchamp avatar Feb 06 '14 19:02 kyleabeauchamp

Hey our tICA pipeline is 100% streaming, which is a huge memory advantage over the previous RMSD-based pipeline. This is a huge win that we should advertise.

kyleabeauchamp avatar Feb 06 '14 20:02 kyleabeauchamp

Yea, though it's not a streaming clusterer, but it can load / project things streaming so you can gain a lot

On Thu, Feb 6, 2014 at 12:44 PM, kyleabeauchamp [email protected]:

Hey our tICA pipeline is 100% streaming, which is a huge memory advantage over the previous RMSD-based pipeline. This is a huge win that we should advertise.

Reply to this email directly or view it on GitHubhttps://github.com/SimTk/msmbuilder/issues/324#issuecomment-34368660 .

schwancr avatar Feb 06 '14 20:02 schwancr

~~We probably also want an easy way / tutorial to calculate the tICA projections of each trajectory.~~

Edit: moved this request to a separate issue, as it's not docs related.

kyleabeauchamp avatar Feb 07 '14 23:02 kyleabeauchamp

Also: need to cite your paper in the tutorial. Probably would also be nice to cite Frank's tICA paper as well.

kyleabeauchamp avatar Feb 07 '14 23:02 kyleabeauchamp

OK looks like Robert has cited you in the HTML tICA theory guide, so that's already done.

kyleabeauchamp avatar Feb 07 '14 23:02 kyleabeauchamp

Ok we also need to modify the commands to due to the changed tica load inteface.

kyleabeauchamp avatar Feb 13 '14 02:02 kyleabeauchamp

Do you want to open a PR?

rmcgibbo avatar Feb 13 '14 02:02 rmcgibbo

Maybe if there is a boring talk at BPS next week On Feb 12, 2014 9:57 PM, "Robert McGibbon" [email protected] wrote:

Do you want to open a PR?

Reply to this email directly or view it on GitHubhttps://github.com/SimTk/msmbuilder/issues/324#issuecomment-34943829 .

kyleabeauchamp avatar Feb 13 '14 02:02 kyleabeauchamp