spikeinterface Proposal for internal handling of `time

Following some discussions with @samuelgarcia and @JoeZiminski , I wanted to open an issue to work out the details of handling time_vectors in a way that won't crash everyone's computers. I'm opening a separate issue from https://github.com/SpikeInterface/spikeinterface/issues/2627, since this will focus on implementation details and will not change the logic.

Summary of what exists:

User-facing API: users interact with times through extractor.sample_index_to_time and extractor.time_to_sample_index
- This means that we have freedom to change the implementation details without users knowing
There are two logical cases:
- time_seconds = t_start + sample_index / sampling_frequency, where t_start=0 by default
- time_seconds = time_vector[sample_index]

Summary of problems:

The first case causes no issues.
The second causes two issues:
- For an hour long recording, a time vector takes ~100M samples * (8 bytes / sample) ~= 1GB of memory. This becomes a problem for parallel processing.
- ~~Going from times in seconds to sample indices with a time_vector currently means a np.searchsorted for each sample, which could get expensive for two reasons: time_vectors are long so individual searches can be expensive, and doing lots of them is even worse.~~ edit: this is not really an issue, see discussion below.

I propose that we can still efficiently implement time vectors with two changes:

Store them in a SharedMemory when doing multiprocessing
~~Don't do a searchsorted for every sample input into extractor.sample_index_to_time~~ edit: seems fine.
- For instance, to speed up individual searches, we could initialize the search at (time_seconds - t_start) * sampling_frequency and look left or right depending on what value we find there.
- To speed up multiple searches, if the array of times we're searching for is also sorted than we can use a merge or something if it seems helpful.

I think this could be addressed by a PR that doesn't affect any other code, and would allow time_vectors in the motion correction interpolation.

Jun 14 '24 15:06 cwindolf

A PR here could try to fix https://github.com/SpikeInterface/spikeinterface/issues/2515 as well

Jun 14 '24 15:06 cwindolf

Question: In what case do we need to call time_to_sample_index repeteadly?

My prior is that np.searchsorted is not gona be the bottleneck for anything reasonable if we accept vectors in the corresponding functions:

So I think the main concerns is going to be memory as you were mentioning.

Can you remind me on how the time vector is used in motion interpolation?

Jun 14 '24 19:06 h-mayorquin

Good point! Seems like the search is not going to be a bottleneck. Even a pretty aggressive version of your benchmark is fast:

In [11]: time_vector = np.arange(600_000_000.0) / 30_000

In [12]: search_times = np.linspace(500_000_000.0 / 30_000, 500_090_000.0 / 30_000, num=300)

In [13]: %%timeit
    ...: np.searchsorted(time_vector, search_times)
    ...: 
    ...: 
90.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In motion interpolation, sample_index_to_time() is called to determine each sample's time in seconds, and these times are used to look up the drift for each sample. time_to_sample_index() is not used, which would be the slow one -- if that operation even were slow, which as we can see it is not. So @samuelgarcia maybe there is no performance issue with having a time_vector in interpolation?

Jun 14 '24 20:06 cwindolf

spikeinterface
spikeinterface copied to clipboard

Proposal for internal handling of `time_vector`

spikeinterface spikeinterface copied to clipboard

Proposal for internal handling of `time_vector`

spikeinterface
spikeinterface copied to clipboard