nwb-schema combine SpikeEventSeries and Clustering

2) Feature Request

How would people feel about folding Clustering into SpikeEventSeries by removing Clustering and adding a SpikeEventSeries.cluster field of dtype int? We could make data optional, in case a user only wants to store the times and not the waveforms (this is pretty common).

This would:

Make the relationship between a spike time and the spike waveform more explicit
Remove redundant timestamps that is currently duplicated across the two objects
Resolve the issue that Clustering should have an electrodes field (https://github.com/NeurodataWithoutBorders/nwb-schema/issues/194)
Resolve the issue that Clustering should be a TimeSeries(https://github.com/NeurodataWithoutBorders/nwb-schema/issues/112)

Checklist

[x] Have you ensured the feature or change was not already reported ?
[x] Have you included a brief and descriptive title?
[x] Have you included a clear description of the problem you are trying to solve?
[x] Have you included a minimal code snippet that reproduces the issue you are encountering?

Oct 16 '18 00:10 bendichter

The only thing I am unsure of is making SpikeEventSeries.data optional. If a SpikeEventSeries doesn't have data, then it fails to meet the requirements of a TimeSeries.

Oct 16 '18 00:10 ajtritt

Would it be bad form to overwrite this?

i.e.

name: SpikeEventSeries:
    - datasets:
        - name: data
          quantity: '?'

...

is that allowed?

Oct 16 '18 01:10 bendichter

It shouldn’t be. What’s the point of inheritance if you can just toss out everything you inherit?

Oct 16 '18 02:10 ajtritt

I agree, making a required field optional should not be allowed in inheritance, if only because it breaks upstream behavior. A TimeSeries without data really isn't a TimeSeries. Do we maybe need a separate type to handle the timestamps-only case?

Oct 16 '18 05:10 oruebel

What if we just make a new subtype of the base TimeSeries where data stores cluster ids?

Oct 16 '18 06:10 ajtritt

@ajtritt yeah that almost works, but I could see cases where a user might want to store waveforms but does not want to cluster them, so I think ideally both would be optional. I like the idea of having a time-only type, @oruebel. I think there are other cases where this would be useful, like instantaneous events.

Oct 16 '18 06:10 bendichter

Yeah, we would keep SpikeEventSeries, and just add a new type.

Oct 16 '18 13:10 ajtritt

I'm all for removing 'Clustering'. I've been thinking of Clustering+SpikeEventSeries as being superseded by the new Unit table (for Unit metadata including clustering metrics) and UnitTimes (for spike times of each clustered unit), and expecting that all that older stuff would soon be deprecated. We've been using Unit + UnitTimes to store the output of clustering.

I think this use of Unit + UnitTimes handles the case where there is no per-spike waveform data. So we could keep SpikeEventSeries and add the '.cluster' field as proposed, and then only use it when storing waveforms. I think this would mean we don't have to worry about the no-data case discussed above.

Oct 16 '18 14:10 tjd2002

This would also resolve #111, which requests additional fields be added to Clustering.

Oct 16 '18 14:10 tjd2002

@tjd2002 I'm glad to hear you are using these structures in your lab and they are working for you! It sounds like you are working with a conversion script that goes from some acquisition format and converts to NWB. This is where we'd like labs to be right now, so that's great.

Eventually we would like to work with acquisition groups to automatically save data as NWB files, so no conversion is necessary. I think the Clustering datatype would be useful for this because you can just append to the dataset as you go, whereas the UnitTimes structure might be a bit complicated (I imagine inserting data into the vector would be difficult during streaming and that appending to the end would be much easier, but I could be wrong about this). The "no-data case" discussed above was for cases like streaming live data where you may want to record times and clusters in real time but you might not want to save waveforms. Do you think that use-case might come up?

Oct 16 '18 21:10 bendichter

My issue with the older "Clustering+ClusterWaveforms+SpikeEventSeries" was that it was all tied pretty closely to a particular clustering workflow (I think it was KlustaKwik?), and folks also found it was not likely to be performant for larger numbers of units. I see you have proposed some changes to Clustering to address those limitations, but the Unit metadata table seemed like a better all-around solution (more easily extensible, e.g.).

I think it's very confusing to have 2 different facilities (Clustering and friends Vs. Unit/UnitTimes) for the same type of data. If possible it would be great to optimize the storage for streaming without requiring this duplication. Would it be possible to just overload the UnitTimes I/O (by creating something like StreamingUnitTimes which is optimized for appending?) to solve the streaming issue, and keep everything in the nice Unit framework?

For EventWaveforms, I like your suggestion elsewhere to add in a '.cluster' property. This seems like it could handle both the streaming, and the off-line use cases?

Oct 17 '18 18:10 tjd2002

Conversation seems to be continuing over at #194, in case anyone else is following along.

Jan 17 '19 02:01 tjd2002

Closing because Clustering has been deprecated.

May 13 '24 17:05 stephprince