feature_engine TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value

Open Morgan-Sell opened this issue 2 years ago • 19 comments

Closes #394.

The transformer accepts a dictionary that defines how numeric variables will be discretized/organized into bins. The transformer calculates and returns the average for the respective bins.

Apr 16 '22 01:04 Morgan-Sell

@solegalli,

I was initially thinking that this class would be similar to ArbitraryDiscretiser where the user passes the desired bins. However, I'm now realizing we may want to incorporate the EqualWidthDiscretiser and EqualFrequencyDiscretiser. What are your thoughts?

Apr 16 '22 17:04 Morgan-Sell

hi @solegalli,

With regards to implementation, are you envisioning the following order of operations or something of the sort?

var_A is a numeric variable in dataframe X.
fit() copies the original values of var_A.
A discretizer, e.g. EqualFrequencyDiscretiser, discretizes and assigns a bin number to the respective values. Let's call this variable var_A_disc.
MeanEncoder.fit() accepts var_A_disc as X and var_A as y.

In the case of multiple variables, the code would iterate through the numeric variables while performing the abovementioned operations.

Apr 17 '22 03:04 Morgan-Sell

hi @solegalli,

I'm developing the tests for TargetMeanDiscretiser(). I see that the team has made substantial edits to the unit tests. I'm using the EqualFrequencyDiscretiser and EqualWidthDiscretiser tests as guides.

Both classes have the following test:

X_t = [x for x in range(0, 10)]
assert all(x for x in X["var"].unique() if x not in X_t)

This seems to check whether the transformed variable X["var"] possesses an integer in the range of 0 to 9. What's the test's rationale?

Gracias!

Apr 20 '22 00:04 Morgan-Sell

@solegalli,

I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize. At the same time, I see that the team has made significant edits to the tests directory. Which unit tests should I use as a guide?

Gracias!

Apr 21 '22 04:04 Morgan-Sell

which line of code are you referring to?

May 07 '22 07:05 solegalli

@solegalli,

I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize. At the same time, I see that the team has made significant edits to the tests directory. Which unit tests should I use as a guide?

Gracias!

I would say, have a look at the tests for the TargetMeanRegressor and TargetMeanClassifier.

May 07 '22 07:05 solegalli

@solegalli,

Soy un boludo! I misinterpreted the instructions, which are in the "TargetMeanDiscretiser". Consequently, I created the _encode_X() and the rest of the boludese (castellano cientifico).

Let me know about adding the ArbitraryDiscretiser(). I'll update the test pronto!

May 09 '22 01:05 Morgan-Sell

hola @solegalli,

I worked on the documentation (still need to review). I've added the class to multiple index.rst files; however, it seems that I'm still missing the table of contents, i.e., toctree. Where is it?

Also, the new testing methodology is really cool! I'm still unclear exactly how it operates though. Maybe we can discuss it sometime?

Abrazo!

May 11 '22 01:05 Morgan-Sell

rked on the documentation (still nee

You need to add the transformer to this file: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/index.rst

I am off on hols since Friday, maybe we can do it on my return?

Cheers

May 11 '22 07:05 solegalli

One of the common tests is failing with this error:

E AttributeError: 'TargetMeanDiscretiser' object has no attribute 'variables_'

May 11 '22 07:05 solegalli

Gracias, @solegalli!

Enjoy your vacay! And, yes, it would be great to review the tests when you return ;)

May 11 '22 23:05 Morgan-Sell

@Morgan-Sell

Did I completely forget about this PR?

Looks like it is more or less good to go?

Jul 05 '22 13:07 solegalli

Ok, I went through this quickly, the docstrings still need to be added.

The idea of this transformer was to perform discretization, and then replace the values by the mean of the target

Now this functionality could be achieved already by combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline.

So I am having second thoughts as to whether we should create this class, when it is possible to obtain the same result with the transformers that we already have.

Would you mind if we keep this PR on hold?

Apologies :/

Jul 05 '22 14:07 solegalli

hi @solegalli,

I'll pause on editing the docstrings.

How did you envision this class to be different than combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline? I thought algorithmically this was the purpose.

I guess one way to think of this class as a tool in the feature-engine toolbox that a user may not be aware of until reading the feature-engine docs, assuming people read the documentation in depth ;)

Jul 06 '22 22:07 Morgan-Sell

Hi @Morgan-Sell

There is a paper in the KDD 2009 data science competition, where the authors discretize all numerical variables, and then replace their values by the target mean. And they use this as predictions to basically select features.

For categorical variables, they replace the categories by the target mean. And they proceed to select features in the same way.

So the MeanEncoder and the TargetMeanDiscretizer aim to do the same thing: replace the categories by the target mean and sort the variables into intervals and then replace the intevals by the target mean.

This is where my thinking came from.

Most users would not be aware of this possibility, because well, they don't know all the literature and probably don't have time to catch up. So having the class ready, is helpful, in the sense that it makes the transformation "visible".

On the other hand, it is more work for us, given that the encoding could be done already with the classes that already exist. So yeah, I am in 2 minds.

What do you think?

Aug 03 '22 07:08 solegalli

I made a PR with some changes here:

https://github.com/Morgan-Sell/feature_engine/pull/11

only 2 things remain doing and then we are good to go!

Aug 03 '22 09:08 solegalli

@Morgan-Sell this in reply to whether or not we should move forward. I am still in 2 minds. What's your view?

Aug 19 '22 08:08 solegalli

I'm leaning towards creating the transformer.

I see Python packages as toolboxes and like to consider the novice craftsman - e.g. me - when designing the tools. Like you said, people are going to open the toolbox and be amzed by the collection of our wonderful tools. It's unlikely for a novice to make the connection between the numerical discretizer and mean encoder.

Also, the use may be unaware of sklearn's Pipeline class.

I think we've already done most of the work.

Once created, how much maintenance is required for each class? I'm assuming it's not much but could depend on the class/transformer's complexity. I guess this class will need to be updated whenever changes are made to the numerical discretizers or mean encoder.

I'm still leaning towards creating the class for the newbies as I'm one of them ;)

Aug 19 '22 22:08 Morgan-Sell

Sounds good, Let's finish the IV selector, and then we come back to this one. In the meantime, we sleep on the idea a bit longer :p

Aug 22 '22 09:08 solegalli

feature_engine feature_engine copied to clipboard

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value

feature_engine
feature_engine copied to clipboard