feature_engine
feature_engine copied to clipboard
Create check_y_is_binary() check. Use in transformers in which the dependent variable must be binary.
Closes #413
Notes from #413:
Many transformers in feature engine require that y is binary. At the moment we do this check within each transformer. We should create a function that unifies this behaviour and call it from the transformers instead of repeating all the logic every time.
Example: encoders woe, pratio
hola @solegalli,
Two questions:
-
When we say "binary", are we restricting the values to 0s and 1s? Or, can
y
be a different set of binary variables, e.g. "Y" and "N"? -
Is it ok if the
y
variable contains one of the binary variables? Assuming thaty
must be either 0 or 1, is it ok for all the values to be 0? The dataset may not be the most informative; however, having a dataset comprised of solely one class is possible.
Hi @Morgan-Sell
Thanks for the PR. I am thinking out loud below:
This function would be used by:
- MeanEncoder
- OrdinalEncoder
- PRatioEncoder
- WoE
For categorical targets:
- The MeanEncoder only makes sense when the target is 0 and 1
- The OrdinalEncoder is more versatile, because it just needs the mean values to order the categories, so could be -1 and 1, but definitely not strings.
- PRatioEncoder will be deprecated. Is a rubbish class
- WoE only makes sense when the target is 0 and 1, the logic is in there already
So in short, we would need this function only for WoE and for MeanEncoder. We could use it for the OrdinalEncoder as well. And the logic of the function already exists in the WoE, here:
https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/encoding/woe.py#L147-L152
and here:
https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/encoding/woe.py#L160-L163
So, we would have to encapsulate that in the new function which you already created.
The function should:
- check that y is binary, if not, raise an error
- if the values of y are not 0 and 1, create a new y and remap the classes to 0 and 1 to compute the encoding.
That should answer question 1.
Regarding question 2, it does not make sense to have only 1 value. But I am not sure we should take care of that. The user should know better, lol. I think I would leave that for now. Is that alright?
More on this: we can use the type_of_target function from sklearn to check if the target is binary as we did here: https://github.com/feature-engine/feature_engine/blob/65fedf100bc85f364b26838866a264dbf7320841/feature_engine/discretisation/decision_tree.py#L166
We could add a parameter to this function to corroborate if the values of y are 0 and 1. It defaults to False.
We would need this function also from the IV selector you are working on in #488
Yes, the return of the Jedi (DS) Master, @solegalli! ;)
Thanks for your feedback! I incorporated the changes and created the appropriate tests.
I agree that the user should know better. I just wanted to make sure.
Should I incorporate _check_y_is_binary()
into MeanEncoder and OrdinalEncoder? Or, should that be a separate PR?
Also, I'm unclear as to what is causing the error associated with test_boxcox_transformer.py. I didn't touch the file. Maybe sometime when you're free we can review some of the logic behind the global test_estimator_checks, e.g. test_estimator_check_wrappers.py?
Hi @Morgan-Sell
It's me again :p
I did some performance comparison, and type_of_target
is slower than y.unique()
. We have the unique()
implementation is WoE, so we should stick with that one.
Also, at the moment, the only transformer checking for binary classification is the WoE.
The OrdinalEncoder and the MeanEncoder don't because they are also suitable for regression.
So I think, we should not move ahead with this change, since it is only used in 1 transformer.
Going back to the OrdinalEncoder and the MeanEncoder, the problem starts when the target is multiclass, because the mean of 3 classes does not make sense because their numbers are not ordered in general. The thing is, there is no way of effectively detecting that a target is multiclass.
So for the time being, I think we need to let that one go.
So, I'd suggest you go ahead and close this PR. And then close the related issue. Would that be alright?
Sorry for the back and forth.