Orion icon indicating copy to clipboard operation
Orion copied to clipboard

Preprocessing non-contiguous segments

Open sarahmish opened this issue 4 years ago • 6 comments

Currently most pipelines share the same preprocessing primitives and in the following order:

  1. mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate this makes the signal equi-spaced based on the specified interval.

  2. sklearn.impute.SimpleImputer for imputing missing values.

  3. sklearn.preprocessing.MinMaxScaler normalizing the data between a specified range.

  4. mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences creating multiple training window examples based on the window_size and step_size.

However, it is not always the case that we want to make the signal equi-spaced but rather retain the gaps within the signal. For this task, there are two main considerations that need to happen.

  1. normalize the data first to maintain the specified range.
  2. create segments based on the suggested max_gap, then for each segment apply the primitive 1, 2 & 4 shown above, then concatenate them together.

the sequence of preprocessing primitives would be:

"sklearn.preprocessing.MinMaxScaler",
"orion.primitives.timeseries_preprocessing.segment", # suggested
"mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate",
"sklearn.impute.SimpleImputer",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences",
"orion.primitives.timeseries_preprocessing.concatenate" # suggested

sarahmish avatar Feb 02 '21 20:02 sarahmish

I don't see any activity here, but I'm wondering if this may have been addressed since Feb?

kb1ooo avatar Oct 26 '21 14:10 kb1ooo

Hi @kb1ooo! It's still under works

sarahmish avatar Oct 29 '21 19:10 sarahmish

@sarahmish thanks. Is there some work on it checked into a branch?

kb1ooo avatar Oct 29 '21 19:10 kb1ooo

There isn't an active branch on this case. The primary change for this feature is in the rolling_window_sequences primitive. It currently works by slicing based on indexes. To make this change, we need to introduce slicing by timestamps and using a max_gap parameter to indicate the maximum gaps to between one element and another.

sarahmish avatar Oct 30 '21 18:10 sarahmish

@sarahmish ok right. Is there a simpler intermediate version where basically the data is pre-segmented (i.e. don't delegate the segmentation logic to orion, let it be the responsibility of the caller), and you would pass the data as say a list of dataframes instead of one dataframe? Then just iterate through the list, applying the same pipeline, and concatenate the rolling_window_sequences.

kb1ooo avatar Nov 01 '21 22:11 kb1ooo

@kb1ooo that's definitely possible. Mechanically, you can just iterate over each dataframe calling orion.fit as a simple work around. My only concern is that you will be training the ML model on epochs with different batches each time. I don't know how that will affect the learning of the underlying model.

sarahmish avatar Nov 08 '21 00:11 sarahmish