cobra icon indicating copy to clipboard operation
cobra copied to clipboard

Preprocessing - TargetEncoder is dangerous

Open lsorber opened this issue 4 years ago • 5 comments

Congrats with the release of this package! I thought I'd contribute back a little with this issue.

The TargetEncoder strikes me as a dangerous transformation. While the docstring does openly say that it suffers from leakage, it gives the impression that it isn't a problem if you apply regularisation or cross-validation. I find that somewhat misleading and think the encoder should probably best be avoided in general.

To illustrate the danger: imagine you have a dataset with only one data point x with corresponding label y, then it's clear that the TargetEncoder will encode x as the exact label y, even when applying regularisation! The issue is that each example x's target value y is used to encode x, and that remains true even as you increase the number of examples.

Let's say you want to deal with that issue by implementing a "LeaveOneOutTargetEncoder", which replaces each example's categorical value with the average target of the other examples that share the same categorical value (see e.g. [1]). That sounds a bit better because none of the examples are allowed to use their own target value to encode their features. But even this encoder suffers from leakage! To see this, imagine that the encoder encodes a category as the leave-one-out sum (instead of the average). The model could then learn the per-category target sums, and simply subtract an example x's leave-one-out sum from the per-category sum to predict the exact label y for the example x.

In general, any transformation that "inserts y into X" should be treated with a lot of scrutiny.

[1] https://contrib.scikit-learn.org/category_encoders/leaveoneout.html

lsorber avatar Jan 16 '21 00:01 lsorber

Hi Laurent, Thanks for taking the time and looking at the package. We understand your concerns and agree that target encoding should be handled with care.

The following points make our implementation to the best of our knowledge safer:

  • First of all, the data is split into three buckets - train, validation and test. Target encoding is fitted only on the train set. Then, it is applied to the validation (for validating the model) and test (for a final test) data using the values from the training data! This means that we are sure that the model is not overfitting. Moreover, we noticed that this approach actually handles overfitting better than other approaches.
  • Secondly, before target replacement takes place, all categories with high cardinality are grouped into one, avoiding the scenario where one category value would be replaced with the exact target value (to be exact, we run a chi-square test between the category of interest and all other categories and if the results are not significantly different, we move it to the “other” category).
  • Lastly, our implementation actually includes an M-estimate method from the link you shared, which serves as another protection against leakage:
stats = y.groupby(X).agg(["mean", "count"])

# Note if self.weight = 0, we have the ordinary incidence replacement
numerator = (stats["count"]*stats["mean"]
             + self.weight * self._global_mean)

denominator = stats["count"] + self.weight

return numerator/denominator

We do get this question a lot when introducing this methodology, so more detailed explanation in the documentation is something we will add as soon as possible.

Is everything clear?

Jan

JanBenisek avatar Jan 21 '21 10:01 JanBenisek

Dear Laurent,

Thank you for your suggestion indeed - this helped us realise that we should indeed spend more attention on elaborating how we tackle what we call 'incidence replacement' - but also on the reasons why. I fully agree that, when this step is implemented incorrectly (e.g. when incidences are calculated across train-test-val) this can lead to false optimism about the algorithm's performance. But when implemented correctly, it does have many advantages and actually offers the key to the hyper-interpretability of the algorithm - each feature can be represented and visualised by its incidence. Additionally, the combination of discretising features and replacing by incidence efficiently handles all typical challenges in data preprocessing, such as outlier replacement, missing value imputation, treating categorical variables. So, in short, it is a vital component of the code and your comment underlines that our approach indeed deserves a more elaborate explanation and motivation! Thanks for pointing out!

Best, Geert

pythongeert avatar Jan 21 '21 11:01 pythongeert

Thanks for the responses @JanBenisek @pythongeert. I think there are two distinct sources of danger under discussion in this issue:

  1. TargetEncoder can be considered dangerous because it may give a false impression of good model performance if used incorrectly. As @JanBenisek replied this can be avoided by splitting train and test before applying this encoder. However, I do still think there's some danger left because a user who is not aware of this may incorrectly apply the encoder before splitting in train and test. That could be resolved by updating the encoder's documentation, which is I think what @pythongeert is proposing.
  2. Even with (1) addressed, TargetEncoder can still lead to a large difference between training and test set performance. In this regard, @JanBenisek replied that binning rare categories and regularising the estimator help. While I agree with that, there are additional steps you can take to reduce overfitting with this estimator. Specifically, you could instead apply a leave-one-out encoder (LOO), or even better, a CatBoost encoder. The CatBoost paper [1] compares all three approaches: target encoding, LOO encoding, and CatBoost encoding. The paper's conclusions are that target encoding is significantly outperformed by both LOO and CatBoost encoding, and that CatBoost encoding can in some cases still provide a significant benefit compared to LOO.

Based on that my recommendation would be to update TargetEncoders documentation to address (1), and to upgrade TargetEncoder to a LOO or CatBoost-style encoder to address (2).

[1] https://arxiv.org/pdf/1706.09516.pdf

lsorber avatar Feb 13 '21 15:02 lsorber

Hi, I think that LOO or CatBoost encoder will be a nice enhancemend to the current implementation and I see the benefits. I am adding it to the list for the next release, thanks again for taking the time.

JanBenisek avatar Feb 17 '21 10:02 JanBenisek

New developments to be done in https://github.com/PythonPredictions/cobra/issues/61.

sandervh14 avatar Mar 09 '23 13:03 sandervh14