scikit-lego
scikit-lego copied to clipboard
[FEATURE] Time Series Target Encoding Transformer
Hi all,
I am a data scientist who is working mainly on the time series problems. Usually, the best features are lags, rolling means and target encoders. I have already a transformer for Spark DataFrames that I use in my daily work for creating these features.
I want to contribute to creating this time series target encoder (which will be used for creating lags, rolling means and target encodings) transformer also for pandas.
For instance, I use the following class to create rolling means at item, store, region levels with rolling window of 13, 52, 104 weeks and some skip periods to prevent data leakage.
The transformer that I created was designed using Window functionality in Spark and to be used in the preprocessing step. However, I am willing to create one for the scikit-learn pipelines.
If you are also think that is a good idea, I would be happy to discuss the implementation.
Best,
I'm interested in discussing this in more detail, but let's first discuss a few nitpicks.
- In the future, please use the code feature of markdown to share code, not a screenshot. A screenshot does not allow one to search or copy/paste.
- Is a
TargetEncoder
the best name for such a component? It feels likeLaggedFeatureEncoder
orLaggedFeaturizer
might be more specific to what the component does. - Does the component output a pandas dataframe? Is that required? Would it be better to just output a numpy array instead? I can imagine that the manual labor of getting
output_cols
aligned with the other params to be a source of human error. - What does
skip_periods
do? - Is it possible make sure that the conventions in this tool are in sync with the RepeatingBasisFunctionTransformer?
- Is scikit-lego the best place for this transformer? Wouldn't it perhaps be better to add this to a tool that specific for time series?
Thanks a lot for your answer Vincent. I am also a data scientist based in Amsterdam and I really appreciated the work you did with this package.
Below I answered your questions. However, I want to ask a question first:
1- Calculating lags and rolling statistics are quite easy on the whole dataset. To be able to use in pipelines, the way I thought is:
- Fit: we need to estimate these statistics for both training and test dates
- Transform: we need to make a join (by date and any other grouping cols) in order to gather these statistics to training and test dates. That makes it a bit of cumbersome to use in pipelines. I really wonder your view on this. Do you think is this still should happen during pipeline, or should it belong to some different preprocessing pipeline?
Answers to your questions: 1- Yes, I will do so from now. Thank you for the suggestion. 2- I thought this transformer could be used to calculate lags, rolling statistics and target encodings. Although target encodings are widely known in the industry, maybe people will not expect to be able to compute lags and rolling statistics with this transformer. So, I agree maybe a better name might be required. 3- I think this is not a requirement. I just thought that calculating those features will be easier with Pandas. However, numpy could be done I guess. 4- It lags the data before calculating the statistics. For example, if I am training a model that will make Week+2 predictions, then I need to calculate all of these statistics with a gap of 2 weeks to prevent data leakage. 5- Could you elaborate more on the specific conventions used in the RepeatingBasisFunctionTransformer? 6- I believe lags/rolling statistics are very fundamental features that can be used in any problem that have time dimension. I believe this transformer would be in good combination with TimeGapSplit.
Is it like CatBoostEncoder
encoder with has_time=True
?
Edit: CatBoostEncoder
uses cumulative statistics, instead of lags or window functions
As far I as know, CatBoost uses cross validation and regularization to encode high cardinality categorical variables to prevent data leakage. The purpose is same, but in this one there is no need to use cross validation and regularization because we already have time dimension. We just need to respect time dimension to prevent data leakage.