feature_engine Create check_y_is_binary() check. Use in transformers in which the dependent variable must be binary.

Closes #413

Notes from #413:

Many transformers in feature engine require that y is binary. At the moment we do this check within each transformer. We should create a function that unifies this behaviour and call it from the transformers instead of repeating all the logic every time.

Example: encoders woe, pratio

Jul 09 '22 22:07 Morgan-Sell

hola @solegalli,

Two questions:

When we say "binary", are we restricting the values to 0s and 1s? Or, can y be a different set of binary variables, e.g. "Y" and "N"?
Is it ok if the y variable contains one of the binary variables? Assuming that y must be either 0 or 1, is it ok for all the values to be 0? The dataset may not be the most informative; however, having a dataset comprised of solely one class is possible.

Jul 09 '22 23:07 Morgan-Sell

Hi @Morgan-Sell

Thanks for the PR. I am thinking out loud below:

This function would be used by:

MeanEncoder
OrdinalEncoder
PRatioEncoder
WoE

For categorical targets:

The MeanEncoder only makes sense when the target is 0 and 1
The OrdinalEncoder is more versatile, because it just needs the mean values to order the categories, so could be -1 and 1, but definitely not strings.
PRatioEncoder will be deprecated. Is a rubbish class
WoE only makes sense when the target is 0 and 1, the logic is in there already

So in short, we would need this function only for WoE and for MeanEncoder. We could use it for the OrdinalEncoder as well. And the logic of the function already exists in the WoE, here:

https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/encoding/woe.py#L147-L152

and here:

https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/encoding/woe.py#L160-L163

So, we would have to encapsulate that in the new function which you already created.

The function should:

check that y is binary, if not, raise an error
if the values of y are not 0 and 1, create a new y and remap the classes to 0 and 1 to compute the encoding.

That should answer question 1.

Regarding question 2, it does not make sense to have only 1 value. But I am not sure we should take care of that. The user should know better, lol. I think I would leave that for now. Is that alright?

Aug 03 '22 06:08 solegalli

More on this: we can use the type_of_target function from sklearn to check if the target is binary as we did here: https://github.com/feature-engine/feature_engine/blob/65fedf100bc85f364b26838866a264dbf7320841/feature_engine/discretisation/decision_tree.py#L166

We could add a parameter to this function to corroborate if the values of y are 0 and 1. It defaults to False.

We would need this function also from the IV selector you are working on in #488

Aug 17 '22 13:08 solegalli

Yes, the return of the Jedi (DS) Master, @solegalli! ;)

Thanks for your feedback! I incorporated the changes and created the appropriate tests.

I agree that the user should know better. I just wanted to make sure.

Should I incorporate _check_y_is_binary() into MeanEncoder and OrdinalEncoder? Or, should that be a separate PR?

Also, I'm unclear as to what is causing the error associated with test_boxcox_transformer.py. I didn't touch the file. Maybe sometime when you're free we can review some of the logic behind the global test_estimator_checks, e.g. test_estimator_check_wrappers.py?

Aug 17 '22 23:08 Morgan-Sell

Hi @Morgan-Sell

It's me again :p

I did some performance comparison, and type_of_target is slower than y.unique(). We have the unique() implementation is WoE, so we should stick with that one.

Also, at the moment, the only transformer checking for binary classification is the WoE.

The OrdinalEncoder and the MeanEncoder don't because they are also suitable for regression.

So I think, we should not move ahead with this change, since it is only used in 1 transformer.

Going back to the OrdinalEncoder and the MeanEncoder, the problem starts when the target is multiclass, because the mean of 3 classes does not make sense because their numbers are not ordered in general. The thing is, there is no way of effectively detecting that a target is multiclass.

So for the time being, I think we need to let that one go.

So, I'd suggest you go ahead and close this PR. And then close the related issue. Would that be alright?

Sorry for the back and forth.

Aug 18 '22 08:08 solegalli

feature_engine feature_engine copied to clipboard

Create check_y_is_binary() check. Use in transformers in which the dependent variable must be binary.

feature_engine
feature_engine copied to clipboard