msmbuilder-legacy
msmbuilder-legacy copied to clipboard
Use Pandas for Assignments?
The current code for building counts looks like this:
for A in assignments:
FirstEntry = np.where(A != -1)[0]
# New Code by KAB to skip pre-padded negative ones.
# This should solve issues with Tarjan trimming results.
if len(FirstEntry) >= 1:
FirstEntry = FirstEntry[0]
A = A[FirstEntry:]
C = C + get_counts_from_traj(A, n_states, lag_time=lag_time, sliding_window=sliding_window) # .tolil()
get_counts_from_traj:
try:
C = scipy.sparse.coo_matrix((counts, transitions),
shape=(n_states, n_states))
except ValueError:
# Lutz: if we arrive here, there was probably a state with index -1
# we try to fix it by ignoring transitions in and out of those states
# (we set both the count and the indices for those transitions to 0)
mask = transitions < 0
counts[mask[0, :] | mask[1, :]] = 0
transitions[mask] = 0
C = scipy.sparse.coo_matrix((counts, transitions),
shape=(n_states, n_states))
Can you point to your favorite tutorials or documentation on the pandas Series object?
http://pandas.pydata.org/pandas-docs/dev/dsintro.html
To do this with a 1D pandas Series of assignments, it might look like this:
from_states = ass
to_states = ass.shift(-1)
to_states looks like this:
In [82]: to_states
Out[82]:
0 2518
1 2997
2 2188
...
22235 67
22236 13
22237 12
22238 NaN
Length: 22239, dtype: float64
Can you outline the operation to get counts from from_states and to_states? Since it's an ndarray subclass, you can still do np.where on it...
So here is the equivalent of transitions = np.row_stack((from_states, to_states))
transitions = pd.DataFrame([from_states, to_states]).dropna(axis=1)
In [102]: transitions.values
Out[102]:
array([[ 2195., 2518., 2997., ..., 139., 67., 13.],
[ 2518., 2997., 2188., ..., 67., 13., 12.]])
On my machine, it appears that you can exactly do transitions = np.row_stack((from_states, to_states)) with from_states and to_states as Series.
One key change is that dropna(axis=1) deals with the NA values. Using the row_stack doesn't deal with the NA.
In [104]: row_stack((from_states, to_states))
Out[104]:
array([[ 2195., 2518., 2997., ..., 67., 13., 12.],
[ 2518., 2997., 2188., ..., 13., 12., nan]])
Oh, I see.
Okay. I'm sold. I think assignments should be a list of Series. @schwancr ?
I realize that this does mean introducing a new dependency, but fuckit.
Comes with EPD and Anaconda, so it's not really new...
Still. Not to be taken lightly. Not everyone is using those, even if we recommend it.
What do we gain from using pandas.Series? Is it just the timestamp?
Just the timestamp and methods associated with time shifts etc. We should just play with it and see if it's useful to us. I'm still not 100% sold.
Could we instead do a single pandas array with: traj, frame, time, assignment?
Then we don't need a list of them
We could create an object like that and use DataFrame.pivot_table() to reindex things as needed.
What you gain is the index, so you're adding some semantic meaning to the difference between assignment[i] and assignments[i+1]. Like, how far is "one index unit". And then when you timeshift them and stuff, this data can be used.
Could we instead do a single pandas array with: traj, frame, time, assignment?
huh?
You can have a data frame with arbitrary columns. So could we just store a single data frame where each entry has time stamp trajectory and assignment
The idea of pivot_table is to take some "data" columns and convert them into the "index" for the object. I've used it in the context of dealing with NMR data.
For example, my NMR data is a CSV file with a lists of residue ID, atom name, and value. I use pivot_table to convert this object into a single 1D series that is indexed by the tuples (residue_ID, atom_name).
We can then combine this with a group_by operation that will allow us to iterate over the trajectories one by one--even though the actual data_frame is not indexed by trajectory.
I def think this is worth trying.
group_by is the key here.
Someone should just try to implement something. Then we can see how it works, for realz.
I am warming to @schwancr's idea that it's an n_snapshots x 3 table where the columns are 'time', 'trajectory_index', 'state_assignment'. That way, if you imagine situations where you are like comparing two MSMs or something and you have dual state decomposition, you could just add a fourth column. I think it's more flexible.
And having the pd.DataFrame, so that these columns are labeled with titles -- that's critical, since each column is semantically so different.
Here's a half-working prototype. I still need to pivot the ass_i to use the timestamps as index, but it's otherwise working:
data = []
num_traj, traj_length = a.shape
for i in range(num_traj):
for j, state in enumerate(a[i]):
if state == -1:
state = np.nan
data.append([i, j, state])
X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
print(ass_i["state"])
0 NaN
1 2195
2 2518
3 2997
4 2188
...
22237 67
22238 13
22239 12
Name: state, Length: 22240, dtype: float64
22240 NaN
22241 NaN
22242 NaN
22243 NaN
22244 715
22245 2250
...
44477 NaN
44478 NaN
44479 NaN
Name: state, Length: 22240, dtype: float64
OK, so one problem with my current approach is that integer series are incompatible with NANs (same as Numpy). Thus, we can't just throw NANs in.
I don't think this should be a problem, though.
Here's the updated version to index the individual trajectory assignments by their timestamps:
X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
ass_i = ass_i.pivot_table(rows=["time"])["state"]
print(ass_i)