cobra
cobra copied to clipboard
Preprocessing - TargetEncoder is dangerous
Congrats with the release of this package! I thought I'd contribute back a little with this issue.
The TargetEncoder strikes me as a dangerous transformation. While the docstring does openly say that it suffers from leakage, it gives the impression that it isn't a problem if you apply regularisation or cross-validation. I find that somewhat misleading and think the encoder should probably best be avoided in general.
To illustrate the danger: imagine you have a dataset with only one data point x with corresponding label y, then it's clear that the TargetEncoder will encode x as the exact label y, even when applying regularisation! The issue is that each example x's target value y is used to encode x, and that remains true even as you increase the number of examples.
Let's say you want to deal with that issue by implementing a "LeaveOneOutTargetEncoder", which replaces each example's categorical value with the average target of the other examples that share the same categorical value (see e.g. [1]). That sounds a bit better because none of the examples are allowed to use their own target value to encode their features. But even this encoder suffers from leakage! To see this, imagine that the encoder encodes a category as the leave-one-out sum (instead of the average). The model could then learn the per-category target sums, and simply subtract an example x's leave-one-out sum from the per-category sum to predict the exact label y for the example x.
In general, any transformation that "inserts y into X" should be treated with a lot of scrutiny.
[1] https://contrib.scikit-learn.org/category_encoders/leaveoneout.html