strax
strax copied to clipboard
Improve splitting arrays of time intervals
What is the problem?
Currently a split is only allowed at times when no interval of data overlaps, this gets around the problem of a plugin missing data from one of its dependencies that overlaps with an interval of another dependency. This solution adds a lot of complexity when aligning chunks for plugins that take multiple inputs as well as windowing computation.
Proposed solution
A possible solution to this would be to split inclusively and concat exclusively, meaning the rule for splitting at any given time is to include overlapping intervals in both sides of the split but when concatenating two datasets intervals are only taken from each chunk if they started within the half-open interval of validity of the chunk. This will mean that when you have intervals that overlap the split time those intervals will be processed twice, but if chunk size is reasonable the affect of one additional row should be negligible on compute time. This approach would eliminate the need for a special plugin type for windowing operations, since all plugins can potentially compute overlapping chunks. Each plugin can define how much overlap on each side they want and each of the overlapping chunks would be processed in parallel, the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk, this can be done when adjacent chunks are collected into local memory for the next step of processing to ensure that all data is included in at least one chunk.