Orion Preprocessing non-contiguous segments

Currently most pipelines share the same preprocessing primitives and in the following order:

mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate this makes the signal equi-spaced based on the specified interval.
sklearn.impute.SimpleImputer for imputing missing values.
sklearn.preprocessing.MinMaxScaler normalizing the data between a specified range.
mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences creating multiple training window examples based on the window_size and step_size.

However, it is not always the case that we want to make the signal equi-spaced but rather retain the gaps within the signal. For this task, there are two main considerations that need to happen.

normalize the data first to maintain the specified range.
create segments based on the suggested max_gap, then for each segment apply the primitive 1, 2 & 4 shown above, then concatenate them together.

the sequence of preprocessing primitives would be:

"sklearn.preprocessing.MinMaxScaler",
"orion.primitives.timeseries_preprocessing.segment", # suggested
"mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate",
"sklearn.impute.SimpleImputer",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences",
"orion.primitives.timeseries_preprocessing.concatenate" # suggested

Feb 02 '21 20:02 sarahmish

I don't see any activity here, but I'm wondering if this may have been addressed since Feb?

Oct 26 '21 14:10 kb1ooo

Hi @kb1ooo! It's still under works

Oct 29 '21 19:10 sarahmish

@sarahmish thanks. Is there some work on it checked into a branch?

Oct 29 '21 19:10 kb1ooo

There isn't an active branch on this case. The primary change for this feature is in the rolling_window_sequences primitive. It currently works by slicing based on indexes. To make this change, we need to introduce slicing by timestamps and using a max_gap parameter to indicate the maximum gaps to between one element and another.

Oct 30 '21 18:10 sarahmish

@sarahmish ok right. Is there a simpler intermediate version where basically the data is pre-segmented (i.e. don't delegate the segmentation logic to orion, let it be the responsibility of the caller), and you would pass the data as say a list of dataframes instead of one dataframe? Then just iterate through the list, applying the same pipeline, and concatenate the rolling_window_sequences.

Nov 01 '21 22:11 kb1ooo

@kb1ooo that's definitely possible. Mechanically, you can just iterate over each dataframe calling orion.fit as a simple work around. My only concern is that you will be training the ML model on epochs with different batches each time. I don't know how that will affect the learning of the underlying model.

Nov 08 '21 00:11 sarahmish

Orion Orion copied to clipboard

Preprocessing non-contiguous segments

Orion
Orion copied to clipboard