feature_engine
feature_engine copied to clipboard
TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value
Closes #394.
The transformer accepts a dictionary that defines how numeric variables will be discretized/organized into bins. The transformer calculates and returns the average for the respective bins.
@solegalli,
I was initially thinking that this class would be similar to ArbitraryDiscretiser where the user passes the desired bins. However, I'm now realizing we may want to incorporate the EqualWidthDiscretiser and EqualFrequencyDiscretiser. What are your thoughts?
hi @solegalli,
With regards to implementation, are you envisioning the following order of operations or something of the sort?
-
var_A
is a numeric variable in dataframeX
. -
fit()
copies the original values ofvar_A
. - A discretizer, e.g. EqualFrequencyDiscretiser, discretizes and assigns a bin number to the respective values. Let's call this variable
var_A_disc
. -
MeanEncoder.fit()
acceptsvar_A_disc
asX
andvar_A
asy
.
In the case of multiple variables, the code would iterate through the numeric variables while performing the abovementioned operations.
hi @solegalli,
I'm developing the tests for TargetMeanDiscretiser(). I see that the team has made substantial edits to the unit tests. I'm using the EqualFrequencyDiscretiser and EqualWidthDiscretiser tests as guides.
Both classes have the following test:
X_t = [x for x in range(0, 10)]
assert all(x for x in X["var"].unique() if x not in X_t)
This seems to check whether the transformed variable X["var"]
possesses an integer in the range of 0 to 9. What's the test's rationale?
Gracias!
@solegalli,
I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize
. At the same time, I see that the team has made significant edits to the tests
directory. Which unit tests should I use as a guide?
Gracias!
which line of code are you referring to?
@solegalli,
I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g.
@pytest.mark.parameterize
. At the same time, I see that the team has made significant edits to thetests
directory. Which unit tests should I use as a guide?Gracias!
I would say, have a look at the tests for the TargetMeanRegressor and TargetMeanClassifier.
@solegalli,
Soy un boludo! I misinterpreted the instructions, which are in the "TargetMeanDiscretiser". Consequently, I created the _encode_X()
and the rest of the boludese (castellano cientifico).
Let me know about adding the ArbitraryDiscretiser(). I'll update the test pronto!
hola @solegalli,
I worked on the documentation (still need to review). I've added the class to multiple index.rst
files; however, it seems that I'm still missing the table of contents, i.e., toctree. Where is it?
Also, the new testing methodology is really cool! I'm still unclear exactly how it operates though. Maybe we can discuss it sometime?
Abrazo!
rked on the documentation (still nee
You need to add the transformer to this file: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/index.rst
I am off on hols since Friday, maybe we can do it on my return?
Cheers
One of the common tests is failing with this error:
E AttributeError: 'TargetMeanDiscretiser' object has no attribute 'variables_'
Gracias, @solegalli!
Enjoy your vacay! And, yes, it would be great to review the tests when you return ;)
@Morgan-Sell
Did I completely forget about this PR?
Looks like it is more or less good to go?
Ok, I went through this quickly, the docstrings still need to be added.
The idea of this transformer was to perform discretization, and then replace the values by the mean of the target
Now this functionality could be achieved already by combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline.
So I am having second thoughts as to whether we should create this class, when it is possible to obtain the same result with the transformers that we already have.
Would you mind if we keep this PR on hold?
Apologies :/
hi @solegalli,
I'll pause on editing the docstrings.
How did you envision this class to be different than combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline? I thought algorithmically this was the purpose.
I guess one way to think of this class as a tool in the feature-engine toolbox that a user may not be aware of until reading the feature-engine docs, assuming people read the documentation in depth ;)
Hi @Morgan-Sell
There is a paper in the KDD 2009 data science competition, where the authors discretize all numerical variables, and then replace their values by the target mean. And they use this as predictions to basically select features.
For categorical variables, they replace the categories by the target mean. And they proceed to select features in the same way.
So the MeanEncoder and the TargetMeanDiscretizer aim to do the same thing: replace the categories by the target mean and sort the variables into intervals and then replace the intevals by the target mean.
This is where my thinking came from.
Most users would not be aware of this possibility, because well, they don't know all the literature and probably don't have time to catch up. So having the class ready, is helpful, in the sense that it makes the transformation "visible".
On the other hand, it is more work for us, given that the encoding could be done already with the classes that already exist. So yeah, I am in 2 minds.
What do you think?
I made a PR with some changes here:
https://github.com/Morgan-Sell/feature_engine/pull/11
only 2 things remain doing and then we are good to go!
@Morgan-Sell this in reply to whether or not we should move forward. I am still in 2 minds. What's your view?
I'm leaning towards creating the transformer.
I see Python packages as toolboxes and like to consider the novice craftsman - e.g. me - when designing the tools. Like you said, people are going to open the toolbox and be amzed by the collection of our wonderful tools. It's unlikely for a novice to make the connection between the numerical discretizer and mean encoder.
Also, the use may be unaware of sklearn's Pipeline class.
I think we've already done most of the work.
Once created, how much maintenance is required for each class? I'm assuming it's not much but could depend on the class/transformer's complexity. I guess this class will need to be updated whenever changes are made to the numerical discretizers or mean encoder.
I'm still leaning towards creating the class for the newbies as I'm one of them ;)
Sounds good, Let's finish the IV selector, and then we come back to this one. In the meantime, we sleep on the idea a bit longer :p