msmbuilder-legacy icon indicating copy to clipboard operation
msmbuilder-legacy copied to clipboard

Use Pandas for Assignments?

Open kyleabeauchamp opened this issue 12 years ago • 39 comments

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

The current code for building counts looks like this:

    for A in assignments:
        FirstEntry = np.where(A != -1)[0]
        # New Code by KAB to skip pre-padded negative ones.
        # This should solve issues with Tarjan trimming results.
        if len(FirstEntry) >= 1:
            FirstEntry = FirstEntry[0]
            A = A[FirstEntry:]
            C = C + get_counts_from_traj(A, n_states, lag_time=lag_time, sliding_window=sliding_window)  # .tolil()

get_counts_from_traj:

    try:
        C = scipy.sparse.coo_matrix((counts, transitions),
                                    shape=(n_states, n_states))
    except ValueError:
        # Lutz: if we arrive here, there was probably a state with index -1
        # we try to fix it by ignoring transitions in and out of those states
        # (we set both the count and the indices for those transitions to 0)
        mask = transitions < 0
        counts[mask[0, :] | mask[1, :]] = 0
        transitions[mask] = 0
        C = scipy.sparse.coo_matrix((counts, transitions),
                                    shape=(n_states, n_states))

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

Can you point to your favorite tutorials or documentation on the pandas Series object?

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

http://pandas.pydata.org/pandas-docs/dev/dsintro.html

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

To do this with a 1D pandas Series of assignments, it might look like this:

from_states = ass
to_states = ass.shift(-1)

to_states looks like this:

In [82]: to_states
Out[82]: 
0     2518
1     2997
2     2188
...
22235     67
22236     13
22237     12
22238    NaN
Length: 22239, dtype: float64

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

Can you outline the operation to get counts from from_states and to_states? Since it's an ndarray subclass, you can still do np.where on it...

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

So here is the equivalent of transitions = np.row_stack((from_states, to_states))

transitions = pd.DataFrame([from_states, to_states]).dropna(axis=1)
In [102]: transitions.values
Out[102]: 
array([[ 2195.,  2518.,  2997., ...,   139.,    67.,    13.],
       [ 2518.,  2997.,  2188., ...,    67.,    13.,    12.]])

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

On my machine, it appears that you can exactly do transitions = np.row_stack((from_states, to_states)) with from_states and to_states as Series.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

One key change is that dropna(axis=1) deals with the NA values. Using the row_stack doesn't deal with the NA.

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

In [104]: row_stack((from_states, to_states))
Out[104]: 
array([[ 2195.,  2518.,  2997., ...,    67.,    13.,    12.],
       [ 2518.,  2997.,  2188., ...,    13.,    12.,    nan]])

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

Oh, I see.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

Okay. I'm sold. I think assignments should be a list of Series. @schwancr ?

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

I realize that this does mean introducing a new dependency, but fuckit.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

Comes with EPD and Anaconda, so it's not really new...

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

Still. Not to be taken lightly. Not everyone is using those, even if we recommend it.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

What do we gain from using pandas.Series? Is it just the timestamp?

schwancr avatar Jun 14 '13 05:06 schwancr

Just the timestamp and methods associated with time shifts etc. We should just play with it and see if it's useful to us. I'm still not 100% sold.

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

Could we instead do a single pandas array with: traj, frame, time, assignment?

Then we don't need a list of them

schwancr avatar Jun 14 '13 05:06 schwancr

We could create an object like that and use DataFrame.pivot_table() to reindex things as needed.

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

What you gain is the index, so you're adding some semantic meaning to the difference between assignment[i] and assignments[i+1]. Like, how far is "one index unit". And then when you timeshift them and stuff, this data can be used.

Could we instead do a single pandas array with: traj, frame, time, assignment?

huh?

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

You can have a data frame with arbitrary columns. So could we just store a single data frame where each entry has time stamp trajectory and assignment

schwancr avatar Jun 14 '13 05:06 schwancr

The idea of pivot_table is to take some "data" columns and convert them into the "index" for the object. I've used it in the context of dealing with NMR data.

For example, my NMR data is a CSV file with a lists of residue ID, atom name, and value. I use pivot_table to convert this object into a single 1D series that is indexed by the tuples (residue_ID, atom_name).

We can then combine this with a group_by operation that will allow us to iterate over the trajectories one by one--even though the actual data_frame is not indexed by trajectory.

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

I def think this is worth trying.

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

group_by is the key here.

kyleabeauchamp avatar Jun 14 '13 05:06 kyleabeauchamp

Someone should just try to implement something. Then we can see how it works, for realz.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

I am warming to @schwancr's idea that it's an n_snapshots x 3 table where the columns are 'time', 'trajectory_index', 'state_assignment'. That way, if you imagine situations where you are like comparing two MSMs or something and you have dual state decomposition, you could just add a fourth column. I think it's more flexible.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

And having the pd.DataFrame, so that these columns are labeled with titles -- that's critical, since each column is semantically so different.

rmcgibbo avatar Jun 14 '13 05:06 rmcgibbo

Here's a half-working prototype. I still need to pivot the ass_i to use the timestamps as index, but it's otherwise working:

data = []
num_traj, traj_length = a.shape
for i in range(num_traj):
    for j, state in enumerate(a[i]):
        if state == -1:
            state = np.nan
        data.append([i, j, state])


X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
    print(ass_i["state"])


0      NaN
1     2195
2     2518
3     2997
4     2188
...
22237     67
22238     13
22239     12
Name: state, Length: 22240, dtype: float64
22240     NaN
22241     NaN
22242     NaN
22243     NaN
22244     715
22245    2250
...
44477   NaN
44478   NaN
44479   NaN
Name: state, Length: 22240, dtype: float64

kyleabeauchamp avatar Jun 14 '13 06:06 kyleabeauchamp

OK, so one problem with my current approach is that integer series are incompatible with NANs (same as Numpy). Thus, we can't just throw NANs in.

kyleabeauchamp avatar Jun 14 '13 06:06 kyleabeauchamp

I don't think this should be a problem, though.

kyleabeauchamp avatar Jun 14 '13 06:06 kyleabeauchamp

Here's the updated version to index the individual trajectory assignments by their timestamps:

X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
    ass_i = ass_i.pivot_table(rows=["time"])["state"]
    print(ass_i)

kyleabeauchamp avatar Jun 14 '13 06:06 kyleabeauchamp