scikit-lego [FEATURE] Time Series Target Encoding Transformer

Hi all,

I am a data scientist who is working mainly on the time series problems. Usually, the best features are lags, rolling means and target encoders. I have already a transformer for Spark DataFrames that I use in my daily work for creating these features.

I want to contribute to creating this time series target encoder (which will be used for creating lags, rolling means and target encodings) transformer also for pandas.

For instance, I use the following class to create rolling means at item, store, region levels with rolling window of 13, 52, 104 weeks and some skip periods to prevent data leakage.

The transformer that I created was designed using Window functionality in Spark and to be used in the preprocessing step. However, I am willing to create one for the scikit-learn pipelines.

If you are also think that is a good idea, I would be happy to discuss the implementation.

Best,

Sep 21 '22 14:09 canerturkseven

I'm interested in discussing this in more detail, but let's first discuss a few nitpicks.

In the future, please use the code feature of markdown to share code, not a screenshot. A screenshot does not allow one to search or copy/paste.
Is a TargetEncoder the best name for such a component? It feels like LaggedFeatureEncoder or LaggedFeaturizer might be more specific to what the component does.
Does the component output a pandas dataframe? Is that required? Would it be better to just output a numpy array instead? I can imagine that the manual labor of getting output_cols aligned with the other params to be a source of human error.
What does skip_periods do?
Is it possible make sure that the conventions in this tool are in sync with the RepeatingBasisFunctionTransformer?
Is scikit-lego the best place for this transformer? Wouldn't it perhaps be better to add this to a tool that specific for time series?

Sep 21 '22 15:09 koaning

Thanks a lot for your answer Vincent. I am also a data scientist based in Amsterdam and I really appreciated the work you did with this package.

Below I answered your questions. However, I want to ask a question first:

1- Calculating lags and rolling statistics are quite easy on the whole dataset. To be able to use in pipelines, the way I thought is:

Fit: we need to estimate these statistics for both training and test dates
Transform: we need to make a join (by date and any other grouping cols) in order to gather these statistics to training and test dates. That makes it a bit of cumbersome to use in pipelines. I really wonder your view on this. Do you think is this still should happen during pipeline, or should it belong to some different preprocessing pipeline?

Answers to your questions: 1- Yes, I will do so from now. Thank you for the suggestion. 2- I thought this transformer could be used to calculate lags, rolling statistics and target encodings. Although target encodings are widely known in the industry, maybe people will not expect to be able to compute lags and rolling statistics with this transformer. So, I agree maybe a better name might be required. 3- I think this is not a requirement. I just thought that calculating those features will be easier with Pandas. However, numpy could be done I guess. 4- It lags the data before calculating the statistics. For example, if I am training a model that will make Week+2 predictions, then I need to calculate all of these statistics with a gap of 2 weeks to prevent data leakage. 5- Could you elaborate more on the specific conventions used in the RepeatingBasisFunctionTransformer? 6- I believe lags/rolling statistics are very fundamental features that can be used in any problem that have time dimension. I believe this transformer would be in good combination with TimeGapSplit.

Sep 21 '22 18:09 canerturkseven

Is it like CatBoostEncoder encoder with has_time=True?

Edit: CatBoostEncoder uses cumulative statistics, instead of lags or window functions

Sep 23 '22 06:09 glevv

As far I as know, CatBoost uses cross validation and regularization to encode high cardinality categorical variables to prevent data leakage. The purpose is same, but in this one there is no need to use cross validation and regularization because we already have time dimension. We just need to respect time dimension to prevent data leakage.

Sep 23 '22 08:09 canerturkseven

scikit-lego scikit-lego copied to clipboard

[FEATURE] Time Series Target Encoding Transformer

scikit-lego
scikit-lego copied to clipboard