etna
etna copied to clipboard
Replace `MeanSegmentEncoderTransform` with `MeanEncoderTransform`
🚀 Feature Request
We can make class MeanEncoderTransform
that will work with any feature. In that case MeanSegmentEncoderTransform
can be replace with SegmentEncoderTransform
+ MeanEncoderTransform
.
Motivation
It will make API more flexible and add new functionality to encoders.
Proposal
- Implement
MeanEncoderTransform
- Add deprecation warning on
MeanSegmentEncoderTransform
until etna-3.0. - Remove
MeanSegmentEncoderTransform
from examples.
This transform should
- Order table by (timestamp, segment), we don't want a target leakage.
- Make mean encoding based on a given feature after ordering.
- Work correctly with NaNs in feature (global running mean).
- Work correctly with new values of feature (global running mean).
- Add additive smoothing like in
CatBoostEncoder
. - On prediction phase we should use mean calculated on the whole history.
Transform should support two modes (like SklearnTransform
):
- macro: calculate running mean across all segments
- micro (per-segment): calculate running mean across each segment individually
Notes:
- There are few implementation possibilities to avoid leak:
- for encoding of current row use all rows with timestamp < that timestamp of the current row -- this option seems more difficult to implement in vectorized fashion
- for encoding of current row use all the rows before including current but after all the encoding make lag-1 in each segment to get rid of leakage -- this option seems easier to implement
Test cases
- Test that it works fine if there are no NaNs and unknown values on train
- Test that it works fine if there are no NaNs and unknown values on test
- Test that it works fine if there are NaNs on train/test
- Test that if works fine if there are unknown values on test
- Make sure that combination of
SegmentEncoderTransform
+MeanEncoderTransform
works fine.
Alternatives
No response
Additional context
No response
Checklist
- [X] I discussed this issue with ETNA Team