feature_engine
feature_engine copied to clipboard
Smoothing on Mean Encoder
Is your feature request related to a problem? Please describe. Hi Sole, currently Im doing the mini course of feature engineering on kaggle. The last seccion is about Mean Encoding and how it can create risks of overfitting specially with unknown or rare categories. They mention that smooting is a way to deal with this. I checked the code but I don't see anything like that. So ¿should MeanEncoder has a smoothing parameter?
I get excited too soon category_encoders
has a MEstimateEncoder
to manage this. But I still believe MeanEncoder
should have this.
Yes, some users also requested to introduce this transformer or smoothing function in this issue: #160
I am still wondering if it is worth to bring in to Feature-engine the functionality that already exists in category_encoders. Category encoders also allows you to select columns, so not sure how much value it would add.
What are your thoughts?
As someone mention in the issue there is no need to reinvent the wheel. Maybe an update on docstring suggesting category_encoders
for smoothing?
yes, good idea. Would you like to add that to the docs?
Otherwise, I could do that later as part of #314
Yeah I will work on this on Weekend
Great! In that case, please make a PR to the branch "pydata_template" as the doc files have been reorganized quite a bit compared to master.
And you probably want to modify this file and / or this file.
Already added links to category encoders in the class api docs: https://github.com/feature-engine/feature_engine/pull/314/commits/eedfc5ea279d13d3a957f7bc0e3e30378a6e7d92
We still need to add some details in the user guide.
Ive been extremely busy sole but I Will try to do it this weekend
added links to docs in #314
Reopening as I think it might be worth bringing in this functionality to our class as well. It is not so complicated.
hola @solegalli,
I can work on this task. Although, I'm always wary when you write, "It is not so complicated." ;)
Go for it.
We need to implement the transformation described in the article cited in this documentation and I guess we can borrow the logic to calculate the prior and posterior probability from category encoders as well.
What I am still not sure, is if we should adapt our MeanEncoder to be able to apply the smoothing, or create a complete new class. The reason for the above is that the MeanEncoder is used by other transformers. Although these other transformers could also benefit from the smoothing. We need to brainstorm a bit.
If we want to distinguish our mean encoder from many others (category_encoders, h2o, dirty_cat etc) we could use different smoothing function. In original Micci-Barreca they are using sigmoid function to calculate smoothing factors, but it possible to use any other monotonic (0-1) function (they also said it in the paper). For example, here authors suggest using x / (x + a) (p. 23, formula 2.14). Or formula 8 in original Micci-Barreca paper (A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems). It will be similar to M-Estimator tho.
thank you @GLevV
Ok, I'm getting back to this issue. I'll get started on it this coming week.
@GLevV, a few questions:
- Are you referring to using x / (x + a) for the weighting?
- Does
x
representhe total number of times that category occurs in the data? - Does
m
represent the "smoothing factor" that is to be selected by the user?
x
is the counts/frequencies of categories, a
is a smoothing factor. If a is 0 then this become vanilla MeanEncoder
as we have right now (which is good for backward compatibility and classes relying on MeanEncoder
).
For more math details please refer to the papers mentioned above.
On the footnote, in Micci-Barecca paper they mentioned that a
for this particular formula could be easily calculated from the data, like this var_y_category / var_y
, we can try to add it as an auto
option.
Ok, cool. Thanks for the explanation. We're saying the same thing.
@solegalli, should we knock out SelectByInformationValue and TargetMeanSelector before jumping into this one? I see it's a priority.
Is it in the workings or stale?
It is waiting for a kind contributor to pick it up :)