feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Add missing_only init param to all imputers. missing_only is used in combination w/ the variables param.

Open Morgan-Sell opened this issue 1 year ago • 10 comments

Closes #388.

When missing_only=True and variables=None, all numerical variables that do not contain missing values will be omitted from self.variables_.

This functionality does not apply to categorical variables.

Morgan-Sell avatar Sep 18 '23 13:09 Morgan-Sell

hi @solegalli,

When missing_only=True and variables=None, how should this impact the transformed dataset after transform() is applied? Should the transform() method only return those variables that were identified as having missing values? Or, should transform() return the complete dataset?

I'm guessing the latter because I can't think of a scenario in which we would want to eliminate variables that do not have missing values.

If my intuition is correct, is the sole purpose of this new feature for computational efficiency? If not, what are the other objectives of this new feature?

Morgan-Sell avatar Sep 22 '23 01:09 Morgan-Sell

hola @solegalli, are you on vacation? Just wanted to see if you saw my question. No rush! I'm accustomed to your rapid responses, so I thought I check-in.

Morgan-Sell avatar Sep 29 '23 19:09 Morgan-Sell

Hey @Morgan-Sell I was indeed on vacation. But the most funny thing is that I did see the question before, and I thought I answered it. LOL. Sorry about that.

transform() in feature engine always returns the complete dataset, unless the transformation adds variables, in which case it would return the complete dataframe plus the new variables.

The aim is not to expand the feature space unnecessarily. If you add indicators for all variables, not just those with NA, you will end up with a lot of variables that just contain the value 0. Is this what you mean by efficiency?

Cheers Sole

solegalli avatar Oct 10 '23 14:10 solegalli

@solegalli, welcome back! Hope you enjoyed your "holiday" as they say on your side of the pond ;)

Based on your explanation, it seems that the missing_only param solely applies to the AddMissingIndicator class. The other imputers do not add additional variables. They fill in the missing values of the existing variables with the desired value(s).

If this is the case, then should this functionality only reside in the AddMissingIndicator class?

Morgan-Sell avatar Oct 10 '23 22:10 Morgan-Sell

Its most meaningful / useful function is indeed for the MissingIndicator class. But we should add it in all imputers. It will play a role when variables =None. At the moment, when variables=None the transformers will learn imputation parameters for all numerical or categorical variables. If we add missing_only=True, they will learn parameters only for those numerical or categorical with NA. More efficient, smaller dictionaries stored, and avoids unexpected imputations when the model is live.

solegalli avatar Oct 11 '23 07:10 solegalli

@solegalli,

I added missing_value, its functionality, and appropriate tests to the ArbitraryNumberImputer and CategorialEncoder classes. Before implementing the changes in all imputers, will you please validate or nullify what I've done thus far?

Morgan-Sell avatar Oct 12 '23 20:10 Morgan-Sell

hi @solegalli,

I responded to two of your comments. Once we agree on the approach, we will quickly move forward with this PR. It's, more or less, copying/pasting the agreed-upon approach on all transformers.

Morgan-Sell avatar Oct 31 '23 21:10 Morgan-Sell

hi @solegalli, i created two global functions to identify either categorical or numerical variables that have missing values. These functions incorporate the "check_or_find_all..." functions.

I have one question about the ArbitraryImputer's fit() order of operations. See my above comment.

I also saw that DropMissingData() already incorporates the missing_value param. Should we keep it as is? Or, should we change to match the style of the other imputers, e.g., transformer imports/applies find_all_variables_with_missing_values().

In the meantime, I will implement the new code in the other imputation transformers.

Morgan-Sell avatar Nov 07 '23 14:11 Morgan-Sell

hi @solegalli,

AddMissingIndicator() already has the missing_only param. I don't recall coding it ;)

Should I refactor the code so the AddMissingIndicator() code base has a similar structure to the other imputers? Or, keep the code as is?

If you have time, take a look at the other imputers. We're almost there!!!

Morgan-Sell avatar Nov 07 '23 23:11 Morgan-Sell

hi @solegalli,

Did you see my above questions regarding the AddMissingIndicator() class?

Morgan-Sell avatar Nov 15 '23 14:11 Morgan-Sell