Uniform and Non-Uniform Time Axes

Open kkappler opened this issue 1 year ago • 0 comments

Recently there have been a number of small updates to mth5 about timing issues (refs can include Nov/Dec 2024 PR DQ, and Autumn 2023 KK) making sample rate a settable property.

These updates and some others have been associated with both performance and accuracy. Performance issues have been due to generating the time axis and estimating the sample rate from the timestamps in the time axis. Time axes are basically vectors of timestamps that get bound to data in an xarray or dataframe. In an old version of the code, the sample_rate was always computed on the fly by taking the median difference of the time stamps. This turned out to be impractically expensive for high sample rate data, because the sample rate is referred to many times during metadata validation and was being recomputed at each call. Two solutions were implemented: Solution 1 was to make an _sample_rate a property which could be set to None, and if requested, call _compute_sample_rate and store the value. Then future calls to sample_rate would return self._sample_rate once it was not None, skipping the repeated computations. Solution 2 involved baking in the assumption that the data are uniformly sampled and that the time axis is perfectly accurate. This I think happens somewhere in the FC data, and perhaps elsewhere. In this case, the sample rate returned is the difference in the timestamps at position 1, and 0.

If all data were uniformly sampled and all timestamps were perfectly precise, this would be fine, but some recent updates have been mostly associated with data that are not sampled with a integer-nanosecond-time-step. For example 3Hz, 30Hz, 24000Hz, etc. (TODO: confirm this is when the idealized sample rate (if integer), has a prime factorization containing a prime number besides 2 or 5 (such as 3, 7, 11, ...) , or the sample rate is only expressible as a floating point with an infinite number of characters after the decimal).

Why are we having this trouble:

If the sample_interval cannot be expressed as an integer number of nanoseconds, there will always be non-uniformity in the time axis as long as we use nano-second limited pd.TimeStamp for our time-axis values (and use the time axis as the source of sample rate).
Another way to say this is that the sample rate needs more information than just an integer number of nanoseconds to be totally characterized

Consequences of non-integer-nanosecond sampled data include hard to trace glitches when merging data on the time axis between runs, as well as the fact that the sample-rate property of a RunTS or ChannelTS is not a true and complete characterization of the time axis.

Proposed Solution 1:

Create an abstract base class called TimeAxis.
Two children of this class are UniformTimeAxis and NonUniformTimeAxis
This does not solve the problem completely, because data sampled at 3Hz are uniformly sampled, but they cannot be represented by a uniform delta-t if we are using ns resolution time-stamps, thus for practical purposes they are non-uniform.
Thus, in the context of pd.TimeStamp as the container for time-stamps, they are non-uniform.
By adding an attribute called say minimum_resolution_of_time_stamp = 1e-9 then a quick test can be done to tell if if a given idealized sample will result in a uniform or non-uniform time axis.
The TimeAxis class can have the following methods:
resolution: This property tells how fine the sampling can be tracked
start_time (or start_sample), and/or end_time (or end_sample),
idealized_sample_rate (floating point resolution)
idealized_sample_interval
to_array or to_axis, which forms the actual vector of values that goes into the numpy array.
The main idea is that all logic to handle the usual attributes sample rate and sample interval that get called can live in these methods, and we can quickly identify by inspecting an object if it is uniform or non-uniform.
Probably users would be strongly encouraged to resample data before archiving (or at least before processing) to a UniformTimeAxis.
sample_rate, sample_intervaland then this could be returned at a selected resolfor amy but it pushes the instances into two cases, Uniform and NonUniform. Uniform, in this case can be used for all time series that
The TimeAxis classes would become mt_metadata objects and would be embedded in RunTS, ChannelTS, Spectrogram or any other time-series-like data container.

Proposed Solution 2:

From rambling thorough Solution 1, it seems that the main issue is the base resolution. If we switch to the attotime package for timestamp handling we get yoctosecond resolution and these issues will possibly go away (at least for MT).
That said, it is still desireable to support non-uniform sampled data, as this is the most general case, and any time series can be represented as a zipped pair of vectors, time and data.

Proposed Solution 3:

When creating the MTH5 archives, upsample the data so that it has an integer number of nanoseconds between samples. This generally supports sample rates up to 1 GHz, which is higher than we should ever need in MT. Add a note to the documentation about this. Add a warning when data are archived at a native, non-integer-nanosecond-sample-interval -- these awkward sample rates are arguably OK for archiving, it is only when processing gets involved that they become problematic. So the warning can recommend an upsampled frequency for starting the processing flow from.
It would be helpful to prepare a table of "friendly" sample rates to upsample to. Note that otherwise, we can wind up with some janky sample rates that are themselves not expressable cleanly. For example, 2400Hz corresponds to 416666.6666666667 ns between samples. But if we upsample so that there are 416666 ns between samples the new sample rate is: 2400.003840006144, which is unsatisfying, and may lead to floating point errors when seraching/matching sample rates in tables. But if we upsample to 400,000 ns per sample, then we get 2500Hz, so this would be a "friendly sample_rate-sample_interval-pair". This feels kindof hokey, but it seems like it would would make the problem go away. ... of course there may be situations when one wants to sample at some multiple of 60Hz to filter powerline noise... Related to MTH5 Issue 225 Stress Tests.

Dec 26 '24 17:12 kkappler