spikeinterface
spikeinterface copied to clipboard
Proposal for internal handling of `time_vector`
Following some discussions with @samuelgarcia and @JoeZiminski , I wanted to open an issue to work out the details of handling time_vectors in a way that won't crash everyone's computers. I'm opening a separate issue from https://github.com/SpikeInterface/spikeinterface/issues/2627, since this will focus on implementation details and will not change the logic.
Summary of what exists:
- User-facing API: users interact with times through
extractor.sample_index_to_timeandextractor.time_to_sample_index- This means that we have freedom to change the implementation details without users knowing
- There are two logical cases:
time_seconds = t_start + sample_index / sampling_frequency, wheret_start=0by defaulttime_seconds = time_vector[sample_index]
Summary of problems:
- The first case causes no issues.
- The second causes two issues:
- For an hour long recording, a time vector takes ~100M samples * (8 bytes / sample) ~= 1GB of memory. This becomes a problem for parallel processing.
- ~~Going from times in seconds to sample indices with a time_vector currently means a
np.searchsortedfor each sample, which could get expensive for two reasons: time_vectors are long so individual searches can be expensive, and doing lots of them is even worse.~~ edit: this is not really an issue, see discussion below.
I propose that we can still efficiently implement time vectors with two changes:
- Store them in a SharedMemory when doing multiprocessing
- ~~Don't do a searchsorted for every sample input into
extractor.sample_index_to_time~~ edit: seems fine.- For instance, to speed up individual searches, we could initialize the search at (time_seconds - t_start) * sampling_frequency and look left or right depending on what value we find there.
- To speed up multiple searches, if the array of times we're searching for is also sorted than we can use a merge or something if it seems helpful.
I think this could be addressed by a PR that doesn't affect any other code, and would allow time_vectors in the motion correction interpolation.
A PR here could try to fix https://github.com/SpikeInterface/spikeinterface/issues/2515 as well
Question:
In what case do we need to call time_to_sample_index repeteadly?
My prior is that np.searchsorted is not gona be the bottleneck for anything reasonable if we accept vectors in the corresponding functions:
So I think the main concerns is going to be memory as you were mentioning.
Can you remind me on how the time vector is used in motion interpolation?
Good point! Seems like the search is not going to be a bottleneck. Even a pretty aggressive version of your benchmark is fast:
In [11]: time_vector = np.arange(600_000_000.0) / 30_000
In [12]: search_times = np.linspace(500_000_000.0 / 30_000, 500_090_000.0 / 30_000, num=300)
In [13]: %%timeit
...: np.searchsorted(time_vector, search_times)
...:
...:
90.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In motion interpolation, sample_index_to_time() is called to determine each sample's time in seconds, and these times are used to look up the drift for each sample. time_to_sample_index() is not used, which would be the slow one -- if that operation even were slow, which as we can see it is not. So @samuelgarcia maybe there is no performance issue with having a time_vector in interpolation?