pynwb icon indicating copy to clipboard operation
pynwb copied to clipboard

soft clustering

Open bendichter opened this issue 5 years ago • 10 comments

Beth Buffalo's lab made an interesting request for spike times representation. They use a spike sorter that does not make a hard classification for units, but a soft clustering where it assigns a probability to 1+ clusters for each spike. I've been trying to think through the different options for how to store this data and I'm interested in what the rest of you think. One approach (a) could be to use the units table and add a "probability" column, e.g.

unit spike_times probability
0 [1.1, 2.2, 3.3] [0.1, 0.5, 0.1]
1 [1.1, 2.2, 3.3, 4.4] [0.9, 0.5, 0.9, 1.0]

The advantage of this approach is that it stores the data very similarly to how we already do and would not require an extension. The downside is that this requires duplication of the float value for each putative unit. Another approach (b) would be

time unit_ids probabilities
1.1 [0, 1] [0.1, 0.9]
2.2 [0, 1] [0.5, 0.5]
3.3 [0, 1] [0.1, 0.9]
4.4 [1] [1.0]

This would be more similar to the UnitSeries that has been under discussion.

@rly thoughts?

edit:

After talking with the SpikeInterface team, I think their proposed solution might be ideal for us as well. Here each spike time is only assigned to the cluster of maximum probability. One nice feature of this approach is that if a user is naive to the probability system they just default cluster assignment to the max probability cluster, which is a good default mode (c):

unit spike_times possible_clusters probability
0 [2.2] [ (0, 1) ] [ (0.5, 0.5) ]
1 [1.1, 3.3, 4.4] [ (0, 1), (0, 1), (1,) ] [ (0.1, 0.9), (0.1, 0.9), (1.0,) ]

The downside is it would require us to have doubly ragged arrays

bendichter avatar Sep 02 '19 13:09 bendichter

Option (a), (b) seem to optimize orthogonal access patterns. Option 1 allows you to easily iterate over units while option 2 allows you to iterate over spike times. The question is, which of the two access patterns is more common? I agree that the problem with both (a) and (b) is that it is up to the user to determine which assignment to actually use based on the probabilities. In that sense, I think Option (c), is closer to the main practical use-case, where most user would look at the highest probability assignment and only look at the actual probabilities in special cases (e.g., ambiguities). Requiring doubly ragged arrays gets messy though. With all that in mind, let me propose some slight variants on your proposal:

Option (d): Given that a user would only use the probabilities in special cases, the question in my mind is, whether rather than putting all data into one table, whether it may be more intuitive to separate the data. E.g., you could have a mix of option (a) and (b). A compromise may to store probabilities in the structure like Option (b) (which my guess is closest to how the spike sorter creates the data). Then, in the units table you would store only the highest probability times (as in Option 3). The VectorData dataset portion of the ragged spike_times column could be shared between the units table and the structure from (b) (which would avoid duplicate storage and ensure consistent ordering between both structures). You could then have an additional column to index (b) from the units table (while if the time columns are shared, the index of that column would in actuality be the same as the VectorIndex from the spike_times column).

Option (a-V2) I think Option (a) with a slight modification may also be a useful option. I.e, instead of storing all times in the spike_times column you would store only the highest probability times in spike_times as in Option (c). You could then add another column all_spike_times that would store all possible times. In this way you use the units table as usual and look at the probabilities and all times when necessary. I.e:

unit spike_times all_spike_times probability
0 [2.2] [1.1, 2.2, 3.3] [0.1, 0.5, 0.1]
1 [1.1, 3.3, 4.4] [1.1, 2.2, 3.3, 4.4] [0.9, 0.5, 0.9, 1.0]

As spike_times and all_spike_times are ragged columns for the same time data, they should be able to share the same VectorData and only use different VectorIndex. In that sense, you mentioned requires duplication of the float value for each putative unit. which I assume refers to the spike times. Just for additional specificity, I don't think this actually duplicates float values, but rather you would have duplicate integer indices in the VectorIndex.

I think this topic will require some further discussion, but currently I would lean towards (a-V2).

Question, would the probabilities normally be a square matrix of #times x #units? I.e., are the probabilities ragged because the probability for most units is 0 (i.e., the matrix the sparse), and you don't want to store all the extra 0s?

oruebel avatar Sep 03 '19 19:09 oruebel

Thanks for the thoughtful reply, Oliver! For a-v2 I don't think a VectorIndex could pull out the right values because it would need to select discontinuous regions of all_spike_times. d sounds interesting but I'm not sure I follow entirely. In other places we have discussed getting around a doubly ragged array by using table references. Is that the idea here?

bendichter avatar Sep 18 '19 13:09 bendichter

@oruebel

Question, would the probabilities normally be a square matrix of #times x #units? I.e., are the probabilities ragged because the probability for most units is 0 (i.e., the matrix the sparse), and you don't want to store all the extra 0s?

Yes, for sure that matrix would often be sparse. In many of these cases you have multiple tetrodes, each with 2-5 possible clusters, but a cell picked by one tetrode can only possibly be in that tetrode's clusters (this is the case for the Buffalo Lab). With neuropixels you have many densely packed electrodes tiling the space, so the clusters tile the space as well, and while most of the electrodes won't pick up a cluster, you won't have these discrete cluster groups.

I am guessing your point is that this is all easier if we just treat this as a matrix and use sparse matrix tools in HDF5 to make it efficiently stored? Sounds like an interesting avenue to explore.

bendichter avatar Sep 26 '19 13:09 bendichter

So in that case you could have a column of the table be "cluster_probabilities" and have the first dim be ragged #unit_spike_times and the second be non-ragged #all_clusters. Something like

unit spike_times probability
0 [2.2, 5.0] [[0.5, 0.5, 0, 0, 0], [0.9, .1, 0, 0, 0]]
1 [1.1, 3.3, 4.4] [0, 0.9, 0.1, 0, 0], [0, 0.9, 0.1, 0, 0], [0.1, 0.6, 0.1, 0.2, 0]]

which on disk would be stored as:

unit spike_times_index spike_times probability_index probability
0 2 2.2 2 [0.5, 0.5, 0, 0, 0]
1 5 5.0 5 [0.9, .1, 0, 0, 0]
1.1 [0, 0.9, 0.1, 0, 0]
3.3 [0, 0.9, 0.1, 0, 0]
4.4 [0.1, 0.6, 0.1, 0.2, 0]

Can we somehow share the index column so we don't need two identical ones?

bendichter avatar Sep 26 '19 13:09 bendichter

@ajtritt what would be the best way to send instructions to a particular DynamicTable column to be stored sparsely in HDF5?

bendichter avatar Oct 02 '19 19:10 bendichter

Can you clarify what you mean by “sparsely”?

ajtritt avatar Oct 02 '19 20:10 ajtritt

@ajtritt I expect there to be a lot of 0s in the probability array, and it would be nice if there were a way to not store them and use a fillvalue of 0 instead.

bendichter avatar Oct 02 '19 20:10 bendichter

Yeah, you can just pass fillvalue=0 through the same mechanism I suggested here and you implemented here

ajtritt avatar Oct 02 '19 20:10 ajtritt

I could add this as the default behavior of columns in dynamic tables, but how would I tell HDF5 to use the fill value and not write every datapoint in the array?

bendichter avatar Oct 02 '19 20:10 bendichter

Ah, I see what you're asking. You would need to use an AbstractDataChunkIterator to return just the chunks you want to write. My understanding that HDF5 only writes the fillvalue when it writes a chunk.

ajtritt avatar Oct 02 '19 21:10 ajtritt