darts icon indicating copy to clipboard operation
darts copied to clipboard

Add window features for RegressionModels

Open hrzn opened this issue 2 years ago • 12 comments

In addition to "lag features" which we already have, it'd be nice to add "window features", specifying window characteristics and corresponding function(s) to apply to create features dynamically in regression models. For instance, it is often helpful to use the trailing mean and variance of the last N points as features. We could also imagine having a way to have fairly generic windows (e.g., "last month", "last week", "the N points starting N-k time steps ago", etc...

hrzn avatar Jul 16 '22 13:07 hrzn

@hrzn Hi! Will have a go at this one instead 😄

adamkells avatar Aug 15 '22 15:08 adamkells

That'd be awesome!! You can take a look at this talk for a nice overview. In Darts the RegressionModel class would be the place to start. Let us know if there are some design decisions you'd like to discuss. The most important thing will be to get the API right and keep it as simple as possible.

hrzn avatar Aug 15 '22 15:08 hrzn

@adamkells are you working on this one?

hrzn avatar Aug 23 '22 09:08 hrzn

@hrzn Planning to work on this on Friday. Watched the talk and want to get some thoughts on scope and design.

I think we probably want to add the functionality in one of two ways:

Option 1: Something analogous to the way lags are currently handled. So adding as function inputs:

  1. windows: (list of integers specifying window sizes for target column)
  2. windows_past_covariates: (list of integers specifying window sizes for covariates)
  3. window_functions: (list containing strings specifying possible windowing functions mean, z_score, ewma etc.)

Option 2: Just having a single nested input dictionary:

{'target': {'function': 'ewma', 'window_size': 5},
'covariate_1': {'function': 'mean', window_size': 10}}

What do you think?

adamkells avatar Aug 23 '22 10:08 adamkells

awesome :)

I think I'm more in favour of Option 2, mostly in order to keep the API not too cluttered, and keep some flexibility in the structure of this dictionary without requiring further adaptations to the call signature. We also already have such a dict for the add_encoders parameter (see docs).

I think your example looks quite good - some notes:

  • I think we should probably also support windows on future covariates (not only target and past covariates)
  • For this reason, although it could be quite powerful to be able to specify per-covariate-dimension windows, I think it'd already be very nice to have the same window applied to all components of {past, future}_covariates. You could also try to do it fancy directly, but I expect a bit of complexity, for instance to handle the case where the target and the past (or future) covariate series share components with the same names. One way could be to make it look like this:
{
 'target': {'function': 'ewma', 'window_size': 5},
 'future': {'all': [{'function': 'mean', 'window_size': 10}]},
 'past': {'component1': [{'function': 'ewma', 'window_size': 5}]}
 }

but it is slightly more complex...

  • We probably need to accept a list of functions (to add potentially several windows)
  • It'd be nice to accept an actual Python function (e.g., a lambda) in addition to a name - something applying on a DataFrame. This way users can specify their own windowing functions :)

@dennisbader @piaz97, wdyt?

hrzn avatar Aug 23 '22 13:08 hrzn

Update on this:

  1. Dictionary format:
    • I think the best format to avoid overly nested dictionaries is to allow keywords of ['target', 'future' and 'past']. I'd prefer to keep specific variables transformations as a separate PR to reduce the scope of this piece.
    • The value for each key can then be either a dictionary defining the function to be applied or a list of dictionaries for multiple functions.
    • Each value dictionary can have the function to be used and all the parameters to be passed to the function.
{
'target': {'function': user_function, 'param_1': 1,' param_2': 2},
'future': {'function': 'mean', 'window_size': 10}},
'past': [{'function': 'ewma', 'window_size': 5},
         {'function': user_function, 'param_1': 1, 'param_2': 2}]}
}
  1. Available functions:
    • There is a little bit of awkwardness around functions which requiring aggregation. I.e. for an ewma transformation, we need to apply ewm(window_size).mean() which cannot easily be passed to the dictionary without wrapping inside a custom function.
    • One option would be to allow an aggregation parameter in the dictionary which can take values such as rolling or ewm.
    • My preference is to have a list of common use cases coded so users can pass 'ewma' or 'rolling_mean' as strings. Then when a user has a use case that falls outside these predefined cases, to advise them to write their own function to pass to the dictionary.

adamkells avatar Aug 30 '22 09:08 adamkells

Update on this:

  1. Dictionary format:

    • I think the best format to avoid overly nested dictionaries is to allow keywords of ['target', 'future' and 'past']. I'd prefer to keep specific variables transformations as a separate PR to reduce the scope of this piece.
    • The value for each key can then be either a dictionary defining the function to be applied or a list of dictionaries for multiple functions.
    • Each value dictionary can have the function to be used and all the parameters to be passed to the function.
{
'target': {'function': user_function, 'param_1': 1,' param_2': 2},
'future': {'function': 'mean', 'window_size': 10}},
'past': [{'function': 'ewma', 'window_size': 5},
         {'function': user_function, 'param_1': 1, 'param_2': 2}]}
}

Sounds great. You could even do it simpler and not support specifying the parameters (param_1 etc) in the function specification. We can assume that the provided function always works on a a window dataframe and does not need extra parameters (which for users I think would be easy to manage, using partial functions for instance). That would also avoid awkward cases where a user-provided function has an argument named window_size).

  • My preference is to have a list of common use cases coded so users can pass 'ewma' or 'rolling_mean' as strings. Then when a user has a use case that falls outside these predefined cases, to advise them to write their own function to pass to the dictionary.

Agree, sounds good 👍

hrzn avatar Aug 30 '22 13:08 hrzn

Hi @adamkells, are you getting a chance to work on this issue? I'm asking because it is quite key for our roadmap, there is no rush, but if you are unsure, we can maybe take it up. Let us know :)

hrzn avatar Sep 06 '22 12:09 hrzn

Hi @hrzn Sorry about the delay, I've set time aside to do open-source work every second Friday so will get a run at it tomorrow. I'm happy to hand it over after tomorrow if you want to finish it off, at any rate I could probably use a bit of help with the testing etc.

adamkells avatar Sep 06 '22 13:09 adamkells

No worries @adamkells, we really appreciate your efforts. It would be great if after next Friday you could maybe open a draft PR of what you have so far, and we can work collaboratively on this from there. Thanks!

hrzn avatar Sep 06 '22 15:09 hrzn

@hrzn Have opened a draft PR with the work so far. Apologies it's a bit of a mess, let me know how I can help improve it.

adamkells avatar Sep 09 '22 16:09 adamkells

Thanks @adamkells, I'll try to look at it sometime soon

hrzn avatar Sep 12 '22 09:09 hrzn