spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

Proposal for handling the user interaction with time

Open h-mayorquin opened this issue 1 year ago • 12 comments

From today's meeting.

Context:

Spikeinterface has two internal representation time: cheap: sampling_frequency and t_start expensive: time_vector

Fine control of the time

Right now t_start is a public attribute at the RecordingSegment level and it is harder to access than maybe it needs be. We need a more comfortable setter. Plus, the setter should handle the fact that the cheap and expensive representations are mutually exclusive.

  • A proposal implementation that overloads the current set_times is in #3117.
  • @alejoe91 proposal, set_times_info as a setter that handles the above. I think this is a good idea because it reflects the other method we have get_times_kwargs and it is a nice symmetry from getters to setters. I am +1 on that

End user API

My proposal for how end users should interact with time:

  1. set_times always sets the the time vector set_times and this is the final source of truth after setting.
  2. I propose adding a second function set_starting_time or shift_starting_time (do not know what's better) that is agnostic to the internal implementation. If there is a time vector it shifts so time_vector[0] = start_time and if there is a t_start then it just shifts or changes that values.

Setting the correct times and shifting the current times are natural concepts that don't rely on internal implementation details and the functions above don't leak internal details to users.

Some extra notes:

  1. This enables the workflow of shifting or doing whatever you want with the time_vector before setting it. So it does not get in the way of @JoeZiminski workflow (set_times is still set_times).
  2. @alejoe91 mention that you could use the workflow of get_times() , modify the vector and finally use set_times again. I think this is good and very close to what I want but has the drawback that you always end with the expensive representation internally which I think we can avoid. Plus, right now this workflow throws a warning
  3. @samuelgarcia mentioned that the same mechanism should be implemented for the sorting. Maybe using the registered recording method is an overkill.

In brief, I think we should create an API that does not leak the internal implementation details of how spikeinterface represents time. We might decide down the line to change the internal representation of time and this should not affect the API. Thus, I think those functions and their docstrings should not make references or rely on understanding of t_start and time_vector.

h-mayorquin avatar Jul 05 '24 17:07 h-mayorquin

Shitting and also compressing times! For instance if you want to synchronize recording with distinct source clock (like 2 neuropixel probes) on long recording we need also to modify the sampling_freq to compensate for for the relative clock drift.

samuelgarcia avatar Jul 05 '24 18:07 samuelgarcia

Shitting

Not sure what you mean with this! Ahahahha

alejoe91 avatar Jul 06 '24 07:07 alejoe91

Thanks for this @h-mayorquin I really like the emphasis on the intuitive API and not leaking SpikeInterface's internal representation to the user. Like you said, as far as the user is concerned, the only concept that needs to be held in mind is that the recording has a time vector associated with it, you can set it with set_times or get it with get_times().

The only thing I'm not sure about is a second function for shifting the values set in the first one. For me it is easier to conceptualise just 'set the desired times with set_times()' rather than having to set times and then shift with a second function. It also introduces some obscured dependencies between internal states (t_start and time_vector) that might lead to confusing behaviour in some situations. So my only suggestion might be to have:

  1. set_times() as is now with get_times(). The main thing the user needs to understand is there is a time vector associated with each recording and, if the default is not used, this can be explicitly set.
  2. A second convenience function to set the time vector only from the start time. Maybe something like set_times_from_start_time(start_time=...) (under the hood this still uses t_start as is now but as far as the user is concerned they just set a time vector).

JoeZiminski avatar Jul 08 '24 13:07 JoeZiminski

Hey @h-mayorquin what do you make of the above comment?

JoeZiminski avatar Oct 11 '24 10:10 JoeZiminski

Hey, sorry for not answering this before. Somehow it escaped me.

The way I think:

  1. set_times and get_times are the natural concepts as you mention. They are clear conceptually and straightforward. The only drawback is this: if you need to shift the times then you need to create another large timestamps vector in memory so you can set it. To avoid having the user deal with this memory issue I am proposing a second method for shifting that should be less memory expensive as it does not require the user to create another timestamps vector (hopefully, we can also handle this internally as a lazy operation but that should not matter for the user).

  2. I agree that introducing a second method opens the space for confusion. Are you thinking in something specific? It would be good to enumerate these pitfalls.

The only thing I'm not sure about is a second function for shifting the values set in the first one. For me it is easier to conceptualise just 'set the desired times with set_times()' rather than having to set times and then shift with a second function. It also introduces some obscured dependencies between internal states (t_start and time_vector)

But you don't need to do this. If you have the right timestamps (with the right timestamps[0] value) you just set them and then you are done. No need to concern yourself with the second function. Two following two situations should illustrate the role that I am envisioning for the second function:

  • You got a recording with timestamps already loaded and just want to shift your timestamps for whatever reason (This avoids having to create an entire new timestamps vector and then shift it as mentioned above).
  • You already set the timestamps, but you want to do an analysis that needs your timestamps to be shifted. (Note that in this case, if you still have the original timestamps that you set before then you can just shift the vector and use set_timestamps again for the same effect).

A second convenience function to set the time vector only from the start time. Maybe something like set_times_from_start_time(start_time=...)

So the full signature is: set_times_from_start_time(start_time=start_time, timestamps=timestamps)?

If yes, then I think it has the problem of requiring the users to create a new timestmaps vector in memory just to re-set and I would prefer to stick to having only one method (set_timestamps).

If no, then I don't know how is it different from my proposal, perhaps is a better name?

I am on the fence about the second method, maybe it is better to pass the memory concern to the users and stick with only one method for maximum clarity. I am not sure.

h-mayorquin avatar Oct 11 '24 16:10 h-mayorquin

Thanks @h-mayorquin I see the utility of the shifting function now, I think before I was coming from my own imagined use-case which was not the same. I think these two proposed methods are serving two difference purposes:

  1. Function to shift already set times, avoiding something like recording.set_times(recording.get_times() + time_offset).
  2. Function to create a time vector from start time and sampling rate only, to avoid something like recording.set_times(np.arange(n_samples) * 1/ sampling_rate + t_start). This is actually just set_t_start() (for semantics renamed set_times_from_start_time() above to avoid introducing the user to a distinctt_start concept).

As you say it makes sense to avoid creating and holding onto the time array wherever possible (e.g. int16 samples, 30khZ, 1 hour recording = 0.216GB people are starting on some very long 10-20h+ recordings).

As above, in the case time vector is already set, this could be achieved with a shifting function. In the case of originally creating an (evenly spaced) time array, this could be the second function above. From the user perspective, the are only ever receiving a representation of times from get_times() which will always return a time vector.

In terms of behind-the-scenes representation, I guess it makes sense to always store the time data as t_start + the sampling rate, except when the time array is unevenly sampled. Do we add some check on user-input set_times() such that, if they are evenly sampled, they are converted to t_start and sampling rate? e.g.

def set_times(time_vector):
    if np.unique(np.diff(time_vector)).size == 1:
        self.t_start = time_vector[0]
        self.time_vector = None
    else:
        self.time_vector = time_vector
        self.t_start = None

and the signatures for the two proposed functions above could be like:

def shift_starting_time(time_offset):
    if self.time_vector:
        self.time_vector += time_offset
    elif self.t_start:
        self.t_start += time_offset
def set_times_with_start_time(t_start).   # maybe 'with' instead of 'from'
    self.time_vector = None
    self.t_start = t_start

JoeZiminski avatar Oct 15 '24 13:10 JoeZiminski

Thanks for keeping up the discussion guys!

@JoeZiminski why not just shift_t_start and set_t_start for the two functions? As pre your proposed implementation, the set_times_with_start_time(t_start) is in fact only setting the t_startand I think this is a clear enough behavior. Adding additional semantics might end up being more confusing than helpful. Probably the second function should raise a warning if the time vector has been previously set.

alejoe91 avatar Oct 15 '24 13:10 alejoe91

Thanks @alejoe91, I think the idea was in part to abstract away the implementation details, so the only concept a user has to keep in mind is that of a time_vector. The proposed sematics describe all operations with reference to the time array (i.e shifting it, setting it with only the start value) and try not to introduce explicitly the t_start.

For example, one situation where mixing these concepts might be confusing is if you have an unevely spaced time vector set on your recording. shift_t_start is okay (although shift_time_vector is imo better for the above reason). But what does set_t_start do? I would expect it to shift the existing time array to start at the specified t_start, whereas it would generate a new, evenly spaced time array starting at t_start. So I think there needs to be something in the semantics of the name that make very clear that the operation is generating a new time array and setting it on the recording.

That being said, find set_times_with_start_time quote verbose and don't like it much either. But I think it is nice to work towards the goal of having the user only have to keep in mind the time_vector construct.

JoeZiminski avatar Oct 15 '24 14:10 JoeZiminski

Hey @h-mayorquin @alejoe91 thanks for the discussion. Would anyone be opposed to me opening a PR that:

  • introduces a convenience function for shifting time
  • introduces a convenience function for creating time array from start value only
  • if set_times is passed a vector with evenly sampled timestamps, convert this under the hood to use _start.

Of course these will be provisional and open to a lot of changes, but just want to make sure this would be a good starting point. Once this is hammered out #3072 can be redrafted to get a feel how the times API is feeling.

JoeZiminski avatar Oct 22 '24 15:10 JoeZiminski

introduces a convenience function for creating time array from start value only

Can you discuss the signature? it can't be number of samples because that if fixed by the recording length. is it sampling rate and starting time? in that case it would be a wrapper for the shifting/setting t_start plus setting sampling rate? How are you thinking?

if set_times is passed a vector with evenly sampled timestamps, convert this under the hood to use _start.

I have been thinking for a while about this and there is a couple of gotchas. I would like to see the PR to see how you are thinking but I was hoping we could discuss this whole thing on a meeting again before moving forward here. That said, feel free to move this forward and we can discuss it on the PR.

h-mayorquin avatar Oct 22 '24 17:10 h-mayorquin

@h-mayorquin I just realised I've been operating under a significant misunderstanding of how the existing API works 🤦‍♂️ For some reason I thought when you load a recording, it has no time set whatsoever, so get_times() wouldn't return anything and you'd have to first set up the times yourself (except for a few cases where neo would handle it from metadata). I have no idea why I thought this 🤦‍♂️

Under this (wrong) understanding, the idea was to initially set the times from a start time, rather than have to make and set a full time vector. But under the actual present API behaviour, of course this makes no sense, from the user's perspective there will already be a time attached to the recording, available through get_times(). If they have a nonzero start time with regularly spaced samples they can just shift this time vector.

This also means there is almost never any cause for someone with a regularly spaced time array to use set_times() and attach a full time vector, they can always just shift the existing recording. So maybe the memory issues are not really much of a consideration. Either the times are regularly spaced and they shift the original time vector, or they are irregularly spaced and they need the full vector anyway. The only case would be if they set a irregularly spaced time vector then later want to set a regularly spaced time vector, but this is a real edge case.

So my apologies, after all this discussion I agree completely with your first post 😅 and just propose a PR to add this shifting function. Then the docs can cover some cases:

  1. Have a regularly spaced time timeseries - either it is loaded by neo or you can manually shift the default start time (0) if you want.
  2. Have an irregularly spaced timeseries - either it is loaded by neo (?) or you must set manually with set_times(). We can mention in the docs to users they should only really use this function if they are setting an irregularly spaced timeseries.

JoeZiminski avatar Oct 22 '24 18:10 JoeZiminski

Have a regularly spaced time timeseries - either it is loaded by neo or you can manually shift the default start time (0) if you want.

Right, although I am still unsure about the naming: Knuth strikes again. We should probably stop bikesheeding at some point and just stick to whatever you decide on the PR x D

Have an irregularly spaced timeseries - either it is loaded by neo (?) or you must set manually with set_times(). We can mention in the docs to users they should only really use this function if they are setting an irregularly spaced timeseries.

Right, we use it this way in neuroconv. Intan is an example of a format where people could load irregular timestamps if their recording has frames that are skipped.

h-mayorquin avatar Oct 22 '24 19:10 h-mayorquin