spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

Proposal for internal handling of `time_vector`

Open cwindolf opened this issue 8 months ago • 3 comments

Following some discussions with @samuelgarcia and @JoeZiminski , I wanted to open an issue to work out the details of handling time_vectors in a way that won't crash everyone's computers. I'm opening a separate issue from https://github.com/SpikeInterface/spikeinterface/issues/2627, since this will focus on implementation details and will not change the logic.

Summary of what exists:

  • User-facing API: users interact with times through extractor.sample_index_to_time and extractor.time_to_sample_index
    • This means that we have freedom to change the implementation details without users knowing
  • There are two logical cases:
    • time_seconds = t_start + sample_index / sampling_frequency, where t_start=0 by default
    • time_seconds = time_vector[sample_index]

Summary of problems:

  • The first case causes no issues.
  • The second causes two issues:
    • For an hour long recording, a time vector takes ~100M samples * (8 bytes / sample) ~= 1GB of memory. This becomes a problem for parallel processing.
    • ~~Going from times in seconds to sample indices with a time_vector currently means a np.searchsorted for each sample, which could get expensive for two reasons: time_vectors are long so individual searches can be expensive, and doing lots of them is even worse.~~ edit: this is not really an issue, see discussion below.

I propose that we can still efficiently implement time vectors with two changes:

  • Store them in a SharedMemory when doing multiprocessing
  • ~~Don't do a searchsorted for every sample input into extractor.sample_index_to_time~~ edit: seems fine.
    • For instance, to speed up individual searches, we could initialize the search at (time_seconds - t_start) * sampling_frequency and look left or right depending on what value we find there.
    • To speed up multiple searches, if the array of times we're searching for is also sorted than we can use a merge or something if it seems helpful.

I think this could be addressed by a PR that doesn't affect any other code, and would allow time_vectors in the motion correction interpolation.

cwindolf avatar Jun 14 '24 15:06 cwindolf