spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

Bug in extract_waveforms_to_buffers()

Open JarrodDowdall opened this issue 8 months ago • 12 comments

Using extract_waveforms_to_buffers() I noticed there is a mismatch between the number of expected waveforms and the actual number returned per unit, when the unit_ids not exactly like range(num_units).

It seems my issue is related to this piece of code in allocate_waveforms_buffers():

for unit_ind, unit_id in enumerate(unit_ids):
        n_spikes = np.sum(spikes["unit_index"] == unit_ind)

https://github.com/SpikeInterface/spikeinterface/blob/2b6e7a2ed43f946274209b0fa7d63e96e0be2f55/src/spikeinterface/core/waveform_tools.py#L183

Obviously, if unit_ids=[0,1,2,3,4] there is no issue, but if unit_ids=[0,1,13,91,101], for example, then there is an issue, because:

>>> unit_ids=[0,1,13,91,101]
>>> list(enumerate(unit_ids))
[(0, 0), (1, 1), (2, 13), (3, 91), (4, 101)]

I could be missing something about how extract_waveforms_to_buffers() is intended to be used, but to me this looks like it was a mistake.

JarrodDowdall avatar Apr 02 '25 02:04 JarrodDowdall

Hi @JarrodDowdall

Thanks for the issue!

The implementation is actually correct. The spikes array in fact uses unit_index, and not unit_id. In your example, unit_id 13 will show up as unit_index 3 in the spike vector.

You said you're seeing a mismatch in the number of expected spikes per unit. Can you give us more details?

alejoe91 avatar Apr 02 '25 08:04 alejoe91

Just to follow up on Alessio's point we explain how we use the term index and id for specific things in the docs here. If that was too hard to find, then we would love suggestions on what part of the docs you explored?

zm711 avatar Apr 02 '25 12:04 zm711

Hi @alejoe91 and @zm711

Sure, I understand there is a distinction between unit_index and unit_id.

But my question is what enforces a consistent mapping between unit_ids to unit_index in this code?

To make my point lets assume the input, unit_ids, can be arbitrarily ordered, or a subset the total units in spikes.

The problem is the code on lines 183:184 (below) makes a strict assumption about the mapping between unit_ids to unit_index:

for unit_ind, unit_id in enumerate(unit_ids):
        n_spikes = np.sum(spikes["unit_index"] == unit_ind)

Later on lines 194/204 the output is assigned according to unit_ids:

waveforms_by_units[unit_id] = arr

That is, this code explicitly assumes unit_ids[0] maps to unit_index=0, unit_ids[1] maps to unit_index=1, and so on.

Which means the output is not invariant to something as simple as shuffling the order of the input unit_ids.

Perhaps this situation doesn't occur that often in practice, but regardless it enforces the user to understand something that is unintuitive and obscure unless you have read the code.

Nevertheless, what I want to be able to do is fetch the waveforms for a subset of the units in spikes, and that is how I came across this situation.

JarrodDowdall avatar Apr 02 '25 15:04 JarrodDowdall

Nevertheless, what I want to be able to do is fetch the waveforms for a subset of the units in spikes, and that is how I came across this situation.

What code you are using for this? You should not interact with unit_index as a user, that is kind of private attribute, an internal detail of the implementation.

h-mayorquin avatar Apr 02 '25 16:04 h-mayorquin

Hi @h-mayorquin

I am using extract_waveforms_to_buffers() in src/spikeinterface/core/waveform_tools.py https://github.com/SpikeInterface/spikeinterface/blob/b8f7253b018d55c730e3cb096e17ff89339a5050/src/spikeinterface/core/waveform_tools.py#L25

But to your point, I don't have to interact with unit_index for the situation to occur. I can just fetch the spikes structured array as is, and pass a subset of the unit_ids to extract_waveforms_to_buffers().

However, maybe there is another user facing function to fetch waveforms that avoids this.

Nevertheless, I chose extract_waveforms_to_buffers() because it is exposed, whereas extract_waveforms_to_single_buffer() is not, and extract_waveforms() is depricated.

Maybe there is another method I am not aware of, in which case please let me know.

It would be great if the user was not bound to SortingAnalyzer for this, so that they could fetch the waveforms directly from the recording given a previously saved spikes array, or even fetch waveforms pre-sorting based on the output of detect_peaks().

JarrodDowdall avatar Apr 02 '25 16:04 JarrodDowdall

I think this is great info to know about what we make public vs what should be private. My normal rec would be to get the waveforms through the SortingAnalyzer, but if you want to avoid the analyzer then maybe Alessio or @samuelgarcia would be better to comment on this since I'm not sure if we have any sortingcomponents way to get waveforms/work at a lower level.

If you don't mind us asking are you trying to build your own sorter using spikeinterface/building a low-level pipeline? Most of our users like that we are providing a high-level API. So I'm also curious about people that want to work at a very low-level. One of my projects is to start better documenting some of these "public" but internal functions. Maybe if we know your overall goals we can better point you to specific functions.

zm711 avatar Apr 02 '25 17:04 zm711

It would be great if the user was not bound to SortingAnalyzer for this, so that they could fetch the waveforms directly from the recording given a previously saved spikes array, or even fetch waveforms pre-sorting based on the output of detect_peaks().

Why would you like to avoid using the SortingAnalyzer? Extracting waveforms is one of its main functionalities. I am asking in good faith, there might be things to be improved.

We should have a long discussion about what parts of the API is exposed. I think that our policy of public by default is bad for our users.

h-mayorquin avatar Apr 02 '25 17:04 h-mayorquin

I think we don't have a good option to instantiate the random spike extension with user-provided indices. @JarrodDowdall would this option enable you to use the SortingAnalyzer?

alejoe91 avatar Apr 02 '25 17:04 alejoe91

@JarrodDowdall

Instead of extract_waveforms_to_buffers() you can use extract_waveforms_to_single_buffer and thensplit_waveforms_by_units this should give the same results but less agressive for memory.

extract_waveforms_to_buffers() is quite old and not used anymore. I hope it is not buggy! It is the one used by the WaveformExtractor. One file or buffer per unit.

I am not understand the potential bug in your first comment. The unit_ind is used to select spikes. The final output is a dict of arrays and the key is the unit_id.

samuelgarcia avatar Apr 02 '25 17:04 samuelgarcia

Hi @zm711 and @h-mayorquin

I am not opposed to the high-level API, and it has its utility in simplicity. Maybe I am not a typical user, but I like the idea that a user could "exit" SpikeInterface at any point in their analyses.

I think this would be particularly useful given that neural data analysis methods are constantly in development, and it is almost certain that users will want to branch off from SpikeInterface at some stage in their analyses.

One nice aspect of this is that new analysis methods that are developed off of SpikeInterface should be easier to incorporate into SpikeInterface later should the community find it useful and request it (obviously this won't always be true, but it could be more often than not).

With regard to SortingAnalyzer, although I do like a lot about the current implementation, I do find it locks the user into SpikeInterface, and I have found it cumbersome to get the outputs from all of the extensions in a consistent way. For instance, it is especially challanging to figure out how the data in each extension is organized without digging through the code.

For what it's worth, of course not every user will want low-level functionality, or even care to leave SpikeInterface, but I think having that possibility, and making it easy for the user to do so, will increase the adoption of SpikeInterface and promote its development.

JarrodDowdall avatar Apr 02 '25 18:04 JarrodDowdall

All great points @JarrodDowdall. Thanks for that write up. I think those are all fair requests and feeds back into Heberto's comment about us finding a better way to document our private vs public functions since some of our public functions are actually internal and so really expect the inputs to be organized in a certain way that we all know and didn't necessarily intend for end-users. It is probably worth having a discussion about low-level public for people like you that want to use tooling, but don't want the high-level vs truly private methods that we really only want to be used within the framework (again as Heberto suggested).

For instance, it is especially challanging to figure out how the data in each extension is organized without digging through the code.

I absolutely agree with that. We should find a unified way to get data from everything accounting for the face that some extensions do take optional organizational arguments. (ie return as an array or as a dict). We should also work more on docs (but that is really on me--I need to pick up my efforts there).

zm711 avatar Apr 02 '25 18:04 zm711

@samuelgarcia The situation I ran into is given that for index, value in enumerate(values):

The mapping between unit_index in spikes and unit_id in unit_ids is determined by the order of unit_ids, which may or may not be consistent with the true mapping.

I did look into extract_waveforms_to_single_buffer(), but it is not exposed when using import spikeinterface.full as si.

I thought that anything exposed in spikeinterface.full was what you primarily intend to be user facing, so I try to stick to that, but there are some exceptions - even in tutorial examples.

Nevertheless, obviously it is possible to import the function with:

from spikeinterface.core.waveform_tools import extract_waveforms_to_single_buffer

But I did notice that in this function unit_ids does not appear to be used anywhere, yet it is still a required input. Setting unit_ids=None works, but it not necessarily transparent to the user.

https://github.com/SpikeInterface/spikeinterface/blob/b8f7253b018d55c730e3cb096e17ff89339a5050/src/spikeinterface/core/waveform_tools.py#L406

There is another small difference: file_path in extract_waveforms_to_single_buffer() and folder in extract_waveforms_to_buffers(), but if extract_waveforms_to_buffers() is old, and will soon be deprecated, and file_path will supersede folder, then it doesn't matter.

Lastly, and it's trivial, but all of the functions in waveform_tools.py, except split_waveforms_by_units(), have the same typo in their Docstring:

"unit_ids: list ot numpy"

https://github.com/SpikeInterface/spikeinterface/blob/e9b39ef400bce4bef9e6f5048660753ed759c331/src/spikeinterface/core/waveform_tools.py#L56

@alejoe91 for now I can make extract_waveforms_to_single_buffer() work for me.

But if I may suggest, I do think it would be helpful for users to be able to fetch waveforms for any unit, and even a particular subset of any unit's spikes, with SortingAnalyzer (i.e., not just the ones chosen randomly by the extension, which is more or less the way users are instructed to use SortingAnalyzer at the moment - and no worries I fully understand what the intention was here, but for me at least, more control over the selection of spikes used to fetch the waveforms would be appreciated).

JarrodDowdall avatar Apr 02 '25 19:04 JarrodDowdall