feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Smoothing on Mean Encoder

Open hectorpatino opened this issue 3 years ago • 14 comments

Is your feature request related to a problem? Please describe. Hi Sole, currently Im doing the mini course of feature engineering on kaggle. The last seccion is about Mean Encoding and how it can create risks of overfitting specially with unknown or rare categories. They mention that smooting is a way to deal with this. I checked the code but I don't see anything like that. So ¿should MeanEncoder has a smoothing parameter?

hectorpatino avatar Oct 15 '21 21:10 hectorpatino

I get excited too soon category_encoders has a MEstimateEncoder to manage this. But I still believe MeanEncoder should have this.

hectorpatino avatar Oct 15 '21 21:10 hectorpatino

Yes, some users also requested to introduce this transformer or smoothing function in this issue: #160

I am still wondering if it is worth to bring in to Feature-engine the functionality that already exists in category_encoders. Category encoders also allows you to select columns, so not sure how much value it would add.

What are your thoughts?

solegalli avatar Oct 16 '21 09:10 solegalli

As someone mention in the issue there is no need to reinvent the wheel. Maybe an update on docstring suggesting category_encoders for smoothing?

hectorpatino avatar Oct 16 '21 19:10 hectorpatino

yes, good idea. Would you like to add that to the docs?

Otherwise, I could do that later as part of #314

solegalli avatar Oct 18 '21 08:10 solegalli

Yeah I will work on this on Weekend

hectorpatino avatar Oct 20 '21 15:10 hectorpatino

Great! In that case, please make a PR to the branch "pydata_template" as the doc files have been reorganized quite a bit compared to master.

And you probably want to modify this file and / or this file.

solegalli avatar Oct 21 '21 08:10 solegalli

Already added links to category encoders in the class api docs: https://github.com/feature-engine/feature_engine/pull/314/commits/eedfc5ea279d13d3a957f7bc0e3e30378a6e7d92

We still need to add some details in the user guide.

solegalli avatar Nov 08 '21 20:11 solegalli

Ive been extremely busy sole but I Will try to do it this weekend

hectorpatino avatar Nov 08 '21 22:11 hectorpatino

added links to docs in #314

solegalli avatar Nov 16 '21 16:11 solegalli

Reopening as I think it might be worth bringing in this functionality to our class as well. It is not so complicated.

solegalli avatar Apr 25 '22 17:04 solegalli

hola @solegalli,

I can work on this task. Although, I'm always wary when you write, "It is not so complicated." ;)

Morgan-Sell avatar May 09 '22 23:05 Morgan-Sell

Go for it.

We need to implement the transformation described in the article cited in this documentation and I guess we can borrow the logic to calculate the prior and posterior probability from category encoders as well.

What I am still not sure, is if we should adapt our MeanEncoder to be able to apply the smoothing, or create a complete new class. The reason for the above is that the MeanEncoder is used by other transformers. Although these other transformers could also benefit from the smoothing. We need to brainstorm a bit.

solegalli avatar May 10 '22 06:05 solegalli

If we want to distinguish our mean encoder from many others (category_encoders, h2o, dirty_cat etc) we could use different smoothing function. In original Micci-Barreca they are using sigmoid function to calculate smoothing factors, but it possible to use any other monotonic (0-1) function (they also said it in the paper). For example, here authors suggest using x / (x + a) (p. 23, formula 2.14). Or formula 8 in original Micci-Barreca paper (A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems). It will be similar to M-Estimator tho.

glevv avatar Aug 06 '22 04:08 glevv

thank you @GLevV

solegalli avatar Aug 06 '22 13:08 solegalli

Ok, I'm getting back to this issue. I'll get started on it this coming week.

@GLevV, a few questions:

  1. Are you referring to using x / (x + a) for the weighting?
  2. Does x representhe total number of times that category occurs in the data?
  3. Does m represent the "smoothing factor" that is to be selected by the user?

Morgan-Sell avatar Aug 18 '22 22:08 Morgan-Sell

x is the counts/frequencies of categories, a is a smoothing factor. If a is 0 then this become vanilla MeanEncoder as we have right now (which is good for backward compatibility and classes relying on MeanEncoder). For more math details please refer to the papers mentioned above.

On the footnote, in Micci-Barecca paper they mentioned that a for this particular formula could be easily calculated from the data, like this var_y_category / var_y, we can try to add it as an auto option.

glevv avatar Aug 19 '22 09:08 glevv

Ok, cool. Thanks for the explanation. We're saying the same thing.

@solegalli, should we knock out SelectByInformationValue and TargetMeanSelector before jumping into this one? I see it's a priority.

Morgan-Sell avatar Aug 20 '22 16:08 Morgan-Sell

Is it in the workings or stale?

glevv avatar Sep 10 '22 10:09 glevv

It is waiting for a kind contributor to pick it up :)

solegalli avatar Sep 10 '22 11:09 solegalli