darts icon indicating copy to clipboard operation
darts copied to clipboard

[INFO] Can I use a pipeline from sklearn that makes usage of a model and feature selection?

Open guilhermeparreira opened this issue 1 year ago • 3 comments

What I want Hi there!

I want to implement Recursive feature elimination for my sklearn models. This is the scikit-learn API.

Can this be a pipeline from scikit-learn and pass it into Darts?

image

guilhermeparreira avatar Jan 11 '24 21:01 guilhermeparreira

Hi @guilhermeparreira,

This should be possible with the following consideration:

  • regression model must be created with output_chunk_length=1 (then obtain the underlying model stored in the model attribute of darts regression model to pass as estimator)
  • tabularized data created inside RegressionModel._fit_model() using the self._create_lagged_data() method must be used as X and y arrays

The output of rfe will be a bit difficult to interpret because features are lags, so make sure to link them back to the lags used to create the Darts model.

madtoinou avatar Jan 12 '24 14:01 madtoinou

Thank you for the answer!

So, I can only use with output_chunk_length=1, right?

Do you have one example of the steps you mentioned in bullet two?

guilhermeparreira avatar Jan 16 '24 19:01 guilhermeparreira

Actually, you could probably also use output_chunk_length > 1 in combination with multi_models=False in order to have only one model but keep in mind that the lags will be shifted (with respected to the corresponding position in the forecasted horizon, see regression model example notebook). I am going to cover the most simple scenario (no covariates) in my example:

import numpy as np
from sklearn.feature_selection import RFE

from darts.models import LinearRegressionModel
from darts.datasets import AirPassengersDataset
import darts.utils.timeseries_generation as tg

ts =  AirPassengersDataset().load()

model = LinearRegressionModel(lags=12, output_chunk_length=1)
X, y = model._create_lagged_data(target_series=ts, past_covariates=None, future_covariates=None, max_samples_per_ts=None)
rfe = RFE(estimator=model.model, n_features_to_select=3, step=1)
rfe.fit(X, y)

model_lags = model._get_lags('target')
# the best lags are -12, -2 and -1,  matching expectations since there is a strong yearly seasonality
best_lags = [model_lags[idx] for idx in np.where(rfe.ranking_ == 1)[0]]

Note that if you use covariates, you would need to concatenate the lags when creating the model_lags variable. I will try to add this example and others to the RegressionModel example notebook as it might be useful for other users.

madtoinou avatar Feb 09 '24 10:02 madtoinou