cobra icon indicating copy to clipboard operation
cobra copied to clipboard

Preprocessing - improve additive smoothing in target encoder

Open sandervh14 opened this issue 4 years ago • 2 comments

Preprocessing: improve additive smoothing in target encoder

Task Description

Additive smoothing in the target encoder prevents possible overfitting, but is disabled by default, with the default weight=0 parameter setting. Improve the approach, see comments https://github.com/PythonPredictions/cobra/issues/67#issuecomment-894122194 and https://github.com/PythonPredictions/cobra/issues/67#issuecomment-894077255.

sandervh14 avatar Aug 06 '21 14:08 sandervh14

The reason why we went for a regular count in the additive smoothing is simply because it is the standard way to do it (introduced in the CatBoost paper and https://github.com/scikit-learn-contrib/category_encoders). So keeping like that would make the implementation more familiar to other users in the field. Typically, when you want to set a non-zero smoothing parameter, you would consider it a hyperparameter to tune with hyperparameter tuning techniques.

MatthiasRoels avatar Aug 06 '21 15:08 MatthiasRoels

To be checked if all points are covered now, as mentioned in

Improve the approach, see comments https://github.com/PythonPredictions/cobra/issues/67#issuecomment-894122194 and https://github.com/PythonPredictions/cobra/issues/67#issuecomment-894077255.

And to be checked versus Matthias's comment above. Hyperparameter support was increased in https://github.com/PythonPredictions/cobra/issues/129, to be checked if the target encoding's additive smoothing is now supported flexibly as option and default option to be like Mathias suggested.

Plus above all necessary: the warning about default weight=0 which is a risk for overfitting and therefore always our responsability as Cobra-using data scientists. Or even walking away from the default weight 0 to be considered. Also keeping in mind that some users do think of the responsibility and warn that not everyone might think of this: https://github.com/PythonPredictions/cobra/issues/24. To be discussed if we decide the warning is enough, or not.

sandervh14 avatar Mar 09 '23 13:03 sandervh14