Orion
Orion copied to clipboard
Preprocessing non-contiguous segments
Currently most pipelines share the same preprocessing primitives and in the following order:
-
mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate
this makes the signal equi-spaced based on the specifiedinterval
. -
sklearn.impute.SimpleImputer
for imputing missing values. -
sklearn.preprocessing.MinMaxScaler
normalizing the data between a specified range. -
mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences
creating multiple training window examples based on thewindow_size
andstep_size
.
However, it is not always the case that we want to make the signal equi-spaced but rather retain the gaps within the signal. For this task, there are two main considerations that need to happen.
- normalize the data first to maintain the specified range.
- create segments based on the suggested
max_gap
, then for each segment apply the primitive 1, 2 & 4 shown above, then concatenate them together.
the sequence of preprocessing primitives would be:
"sklearn.preprocessing.MinMaxScaler",
"orion.primitives.timeseries_preprocessing.segment", # suggested
"mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate",
"sklearn.impute.SimpleImputer",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences",
"orion.primitives.timeseries_preprocessing.concatenate" # suggested
I don't see any activity here, but I'm wondering if this may have been addressed since Feb?
Hi @kb1ooo! It's still under works
@sarahmish thanks. Is there some work on it checked into a branch?
There isn't an active branch on this case. The primary change for this feature is in the rolling_window_sequences primitive. It currently works by slicing based on indexes. To make this change, we need to introduce slicing by timestamps and using a max_gap
parameter to indicate the maximum gaps to between one element and another.
@sarahmish ok right. Is there a simpler intermediate version where basically the data is pre-segmented (i.e. don't delegate the segmentation logic to orion, let it be the responsibility of the caller), and you would pass the data as say a list of dataframes instead of one dataframe? Then just iterate through the list, applying the same pipeline, and concatenate the rolling_window_sequences.
@kb1ooo that's definitely possible. Mechanically, you can just iterate over each dataframe calling orion.fit
as a simple work around. My only concern is that you will be training the ML model on epochs with different batches each time. I don't know how that will affect the learning of the underlying model.