pyxdf
pyxdf copied to clipboard
Sync timestamps
As requested, i moved the pull request to the new repo structure according to https://github.com/sccn/xdf/pull/28
As mentioned there:
Allows to resample (using interpolation for numeric values and shifting for strings) all streams to have their timestamps sample-wise in sync with the fastest stream. Example streams, fileheader = load_xdf(filename, sync_timestamps='linear') This might help adress #24 and mne-tools/mne-python#5180
I had to manually copy the code, as the history is broken, and hope i did not miss anything. Tests did run fine.

On a side note, i added the feature before the ordering/recasting currently discussed in https://github.com/sccn/xdf/pull/42 This implementation should therefore survive, if there is a decision made for the breaking change. But i'll have to adapt the tests.
Can you make a trivial change and commit it? I'm hoping to trigger a CI build and that it will run the test suite. Maybe add a newline at the end of test_limit_on_synced_stamps.py.
Seems to work!
The test in CI gave a warning: pyxdf.py:230: DeprecationWarning: invalid escape sequence \R. I also get this, when i run pytest the first time after a change. If i run the test a second time, and from then on, this warning is gone and no longer pops up. The line 230 that causes this warning is the final """ from the docstring of load_xdf. Haven't found a way to get rid of this warning. Might be a bug in pytest?
I guess the next action item is to rebase this commit on the changes in master.
What's the status of this PR? I'm not even sure what this adds - is this implementing resampling of all streams to a common sampling frequency? @agricolab maybe you could update this PR with a short description in your top comment.
Updated the top comment with a direct quote from the original pull request.
Regarding your question, yes it syncs streams by interpolation/shifting so that samples match. Regarding the status, this is relatively old now, and does therefore not (yet) account for the recent feature of subselection of streams from an xdf-file.
Nice, this feature is important. I can rebase if you want.
That'd be great, i am currently quite busy with work, and would probably find the opportunity to work on this not before january.
Alright, I rebased and fixed some PEP8-related stuff. Maybe I can come back to this before January.
@agricolab I assume you don't have time to finalize this PR? In addition to rebasing, I think what's currently missing is support for stream selection (which has been added to main). Also, it would be nice if the frequency that is used for resampling could be defined manually. By default, it could still be the highest sampling frequency among all (selected) streams, but this is not always practical if there's e.g. an audio stream, which could also be downsampled.
I can make time in the next few weeks. Considering it's been apparently five years in the making, it really is time to resolve this 😅
😄 This would be really great! If this is available in PyXDF, downstream projects don't have to roll their own implementations.
Speaking of which, we've been discussing the problem of resampling here: https://github.com/cbrnr/mnelab/issues/385
Specifically, we were thinking about a solution when there are long gaps between otherwise relatively regularly spaced samples. @DominiqueMakowski implemented linear interpolation (using pandas), which is certainly better than assuming that samples are regularly spaced, but I am not sure if interpolating such long periods is the best solution. I think that replacing these segments with NaNs would be a better option.
WDYT? Is this covered by your changes, and if so, how do you deal with large gaps?
IIRC there was no consideration for large gaps. I'll make up my mind, but probably it's simplest to let the user decide. Which means even more arguments to the function 🥴
In the end I'd say it depends on the source of the data, or let's say it's complexity / spectrum / non-stationarity and corresponding timescale.
I updated everything to integrate with the most recent version. I also updated the tests, and integrated a smoke test with the example-files (i.e. minimal.xdf) Note that i used a pytest fixture for that. TODOs are
- [ ] Handle long gaps between otherwise relatively regularly spaced samples
- [x] Frequency that is used for resampling can be defined manually
- [ ] Handle numpy deprecation warning for creating an array from a list of strings (in https://github.com/agricolab/xdf-Python/blob/sync_timestamps/pyxdf/pyxdf.py#L1004)
Small note:
Handle long gaps between otherwise relatively regularly spaced samples
I would not only do a special treatment when long gaps are detected and otherwise assume that is regularly sampled. In fact, I would never assume that it is (from experience, it's often not, and a few milliseconds are vital in some applications). + LSL returns the timestamp of all samples so it seems like a shame not to use it and have a loss of precision
This PR is about creating regularly sampled time series via resampling. So I think large gaps need to be treated somehow in order to avoid interpolating a large gap and instead insert e.g. NaNs.
@agricolab I was wondering if this PR can also deal with just a single stream, i.e. resample that stream to a regular sampling frequency (and dealing with larger gaps). Or is this already possible with the current release?
With use_samplingrate, now one should be able to resample a single stream (or multiple streams). It is a result as the stream with the highest sampling rate needs to be resampled anyways if a different sampling rate is used, so there's that
Regarding the large gaps in the recording, i am not sure how this will interfere or replicate stuff that's already being handled by jitter... and clock_reset... and winsor_threshold
If i understand them correctly, they already try to manage that, ill read that part of pyxdf the next days in more detail... but i wouldn't like to change the behavior of these parameters for resyncing/resampling.
e.g. https://github.com/agricolab/xdf-Python/blob/sync_timestamps/pyxdf/pyxdf.py#L692 already looks for breaks, doesn't it.
So we would change whether the individual segments are just kept (as it seems to be the case right now, see https://github.com/agricolab/xdf-Python/blob/sync_timestamps/pyxdf/pyxdf.py#L705) or fill the space between segments with a single NaN (that could be propagated to the _sync function call for handling)
I understand your request, but i try not to fall into a feature creep that'll result in a rewrite of pyxdf.
You absolutely don't have to do this if you don't want to. I was just wondering that from a user perspective, the most useful representation of a single regularly-sampled stream would be a resampled version that's based on all timestamps. This makes sure that the sampling frequency is indeed e.g. 500Hz, as opposed to 500.03231Hz as it's currently the case if you just take the time stamps as is. np.interp() does exactly that. And an additional step would then be to handle larger gaps instead of interpolating, this can be done as a post-processing step.
Currently, regardless of whether one uses the fastest stream for syncing (i.e. it's effective srate) or sets it manually, the time_stamps are created newly to be equidistant. See https://github.com/agricolab/xdf-Python/blob/sync_timestamps/pyxdf/pyxdf.py#L985-L998
There was a reason why i did it so complicated for the highest effective srate, but I can't recall right now after so many years. Maybe I was just stupid...
I agree that how to handle large gaps is worthy of more discussion. I didn't consider it an issue because of the other routines already there which do normalize the timestamps. I'll think a little bit about how and whether we can integrate it easily.
Ok, I started implementing.
Segments will be detected by the _jitter... helper function and I'll propagate information about start and stop of each streams segments to the _sync functions.
One issue is that there might be different segments for different streams, i.e. we can't simply use one stream with whom to sync the rest. I'll accommodate for that. It'll be a little bit tricky because the time between a stream's segments is not in sync with the streams sampling rate.
Additionally limiting to overlapping timestamps will get a different behavior. It had returned all overlapping samples from earliest to latest, but now will only return overlapping segments available in all streams.
One issue is that there might be different segments for different streams, i.e. we can't simply use one stream with whom to sync the rest. I'll accommodate for that. It'll be a little bit tricky because the time between a stream's segments is not in sync with the streams sampling rate.
I'm trying to think about this in simpler terms when you only have one stream. Or in other words, you handle each stream separately by resampling (interpolating) to evenly spaced samples, even between segments. If you do that for every stream independently, you end up with samples at the exact same time points, meaning that they should be synchronized. Regarding periods between segments, these could be filled with NaN as a post-processing step.
Or am I simplifying too much here?
Consider a stream with a Fs of 10 Hz. One segment might start at 1.1s and go to 2.1s, the second segment starts at 2.75 and go to 4.65. The start and stop of each segment is not aligned to the effective sampling rate within each segment. Which segment should be shifted and how... That's the question.
If you interpolate to 10Hz and start e.g. at second 1.1, then you have regularly spaced samples for the entire time period, including segments and periods in between.
Here's a little example to illustrate what I mean:
import matplotlib.pyplot as plt
import numpy as np
# x ... original time values
# y ... original signal values
x = np.linspace(0, 2 * np.pi, 10)
x = x[[0, 3, 4, 6, 7, 8, 9]]
y = np.sin(x)
def interp(xvals, x, y, limit=None):
yinterp = np.interp(xvals, x, y)
if limit is None:
return yinterp
for i, interval in enumerate(np.diff(x) > limit):
if interval: # difference > limit, so fill with np.nan
yinterp = np.where((xvals < x[i + 1]) & (xvals > x[i]), np.nan, yinterp)
return yinterp
# xvals ... new (resampled) time values
# yinterp ... new (resampled) signal values
xvals = np.linspace(0, 2 * np.pi, 50)
# limit=2 means that gaps > 2 (in the original units of x) will not be interpolated, but
# filled with np.nan
yinterp = interp(xvals, x, y, limit=2)
plt.plot(x, y, 'o')
plt.plot(xvals, yinterp, '-x')
plt.show()
The behavior of _jitter_removal which is currently used by load_xdf to detect breaks in the timestamps is to linearize the timestamps within each segment independent from all other segments
https://github.com/agricolab/xdf-Python/blob/sync_timestamps/pyxdf/pyxdf.py#L710-L718
That means each segment can have a different effective srate, and segments do not share a common time basis, i.e. samples are not aligned on a common srate. The streams field effective_srate will be a duration-weighted average of all segments https://github.com/agricolab/xdf-Python/blob/sync_timestamps/pyxdf/pyxdf.py#L720C1-L728C76
For a multi-segment stream (which can always be the case), that means we
a) start at the earliest available sample, end at the latest available sample, and resync all segments to the average effective srate (or the srate given by the argument) -> all segments will have the same srate and are aligned b) resample all segments to the average effective srate, but keep them unaligned c) resample all segments to the average effective srate, and align them to the segment with the longest duration
Optionally, we can fill gaps with NaN.
Now, because the effective srate averaged across segments will be unequal to the srate within each segment (at least due to floating point accuracy) all segments will need to be resampled anyways to their new srate. That means c) doesn't make much sense.
If we use the stream with the highest effective srate, one could argue to leave that stream (and its segments) untouched and simply sync all other streams with the time-stamps as given by the fastest stream. That makes sense, as it maintains the original signal fidelity in .time_series by not interpolating. This was the original idea of this PR (while still being naive to the segmentation issue). If the fastest stream is segmented, we will likely cause loss of data for the other streams, as we will just jump the gaps between the segments of the fastest stream. This is not optimal. It could be circumvented by filling the gaps with NaNs, but then there is the issue with srate to use (either smoothen the change in srate and time basis between segments which would not affect signal fidelity but cause havoc for most signal processing routines as they can't assume a stationary srate, or sync everything to a common basis which is simple and straightforward, but results in loss of fidelity).
There is now the option to say that if the user want's to sync with the fastest stream, and this stream has only a single segment, we use that streams time_stamps and leave its time_series untouched. If there are multiple segments, we sync everything to a common basis from earliest to latest sample and fill the fastest strteams gaps with NaNs. I don't like that because it would result in different behavior depending on the data and can not be controlled by the user.
So, the simplest default behavior would be let the user decide whether to use the fastest stream or support their own desired srate. Then, all streams (including the fastest) will be resampled to that srate and span from the earliest sample of all streams to the lastest sample of all streams, and gaps not supported by data from segments will be set to NaN.
This sounds good to me! I understand your argument that you are reluctant to resample a stream even though it is sometimes not necessary. However, resampling (if done correctly) should not affect the fidelity of the signals. Just because you are not using the exact samples that dropped out of the amplifier does not mean that this is less reliable (or wrong in any other sense).
As a user, I basically almost always would like to have my data in a 2D array with some constant sampling frequency. So are you saying you would use the original (effective) sampling frequency and just fill gaps with NaN for the fastest stream to avoid resampling?
Another option related to this issue could be to let the user decide if they want (1) the original data (which is what they get now, each stream separately with their samples and associated time values) or (2) if they want a resampled version (i.e., a 2D NumPy array). What you are suggesting is something in between, right? The original values for the fastest stream, and resampled values for the others. I'm not sure this is worth the added complexity.
I would strongly push back against resampling - or doing something to the signal & timestamps - by default (actually, if anything this PR should be non-breaking) + I would always return the rawest data (raw samples with raw timestamps so that people can postprocess if they want).
My argument against resampling by default is that it is never a trivial process, even with linear interpolation. Problems of signal aliasing can appear and strongly distort the signal.
I strongly agree with @cbrnr to let the user decide if they want the original data or a resampled version (In which case I think that resampling to the highest existing frequency is not good enough, you need at least to double the rate in order to avoid signal aliasing)