MLJModels.jl icon indicating copy to clipboard operation
MLJModels.jl copied to clipboard

Enhance treatment of missing value in one-hot encoder

Open ablaom opened this issue 2 years ago • 10 comments

There is now missing value handling in OneHotEncoder but this simply propagates the missing values. I guess it might be nice to offer some other popular options for handling missing values which might be complicated to handle in a post-processing step. See also the discussion here.

@Chandu-4444 @Frank-III @OlivierLabayle

ablaom avatar May 04 '22 21:05 ablaom

The current implementation comes under the all-missing case. This is the easiest and most straightforward case I can say. Any other cases like all-zero, and category can also be implemented, and I guess I can use a part of my previous commit (link) for incorporating these. A simple modification to it and the current implementation for handling missing values in OneHotEncoder can enable all the above-mentioned methods.

Any other ideas would be most welcomed.

Chandu-4444 avatar May 05 '22 19:05 Chandu-4444

all-zero looks like the simplest. One question for category is how to handle missing values that appear for a feature that did not havemissing values in training (fit). Here's a proposal for this:

We introduce a new hyper-parameter features_with_missing which can either be: (i) a vector of feature names, (ii) the symbol :all, (iii) the symbol :auto. For such features, when specified as a vector, we will always have the extra missing category, regardless of the existence of missing values in the input for transform. If features_with_missings == :auto then the actual list used is inferred from the training data: a feature is on the list if missing appears for that feature in the training data. If features_with_missings === :all then every feature gets the extra missing category.

In transform, if missing appears for a feature not on the list, then an informative error is thrown, explaining the possibility that the problem can be corrected by retraining and explicitly specifying features_with_missings appropriately.

The default could be :all or :auto. Maybe :auto is okay. It might lead to a surprise for the user that never reads documentation, but the error message explains what to do.

We will also need a hyper-parameter to specify the kind of missing handling - :propogate, :all_zero or :category. Name suggestion: ~~missing_handling~~ handle_missing (for consistency with sk-learn). Default: :propogate. If missing_handling is not :category, and features_with_missing is not it's default value, then clean! should issue a warning that features_with_missing is being ignored. Or we could combine the two new hyper-parameters into one somehow, although I'm not sure how to do this without creating cognitive dissonance.


I wonder how this is handled elsewhere. Of course, often one-hot encoding is sometimes implemented as a "static" transformer (no seperate training step) and this doesn't come up. This is not, however, an argument for making it static, in my view. I think it is preferable to have a consistent number of spawned features in the output, each time transform is called. That is, by training just once, you can arrange that the number of spawned features does not depend on whether there are - or are not - missing values in a particular field to be transformed. For otherwise downstream operations, expecting a certain number of features might fail unexpectedly.

Anyone have a different suggestion?


Probably good to introduce the two options in separate PR's, starting with the easiest all-zero case.

ablaom avatar May 05 '22 20:05 ablaom

This page can help relate a few things said by @ablaom.

Chandu-4444 avatar May 06 '22 17:05 Chandu-4444

Is this how the output should be for the minimal all-zero case?

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(missing_handling = "all-zero")

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
name_missing = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

Chandu-4444 avatar May 07 '22 16:05 Chandu-4444

No, rather it's the same as the current behaviour, except instead of missings, use zeros. You don't need to spawn an extra column in this case:

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(handle_missing = :all_zero)

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0])

However, note that this means we cannot have drop_last=true in this case, because then we can't distinguish missing from the last class. So I suggest clean! needs to check this. I suggest that if handle_missing == :all_zero, then clean! changes drop_last to false, if it is true, issuing a warning in that case.

Also:

  • let's use the name handle_missing for consistency with sk-learn
  • let's use symbols for it's values, not strings

ablaom avatar May 08 '22 22:05 ablaom

Thank you for adding the support for propagating missing values! I think I have identified a bug if the first value in a vector is missing:

using MLJModels, CategoricalArrays, MLJBase
X = (x=categorical([missing, 1, 2, 1]),)
t  = OneHotEncoder(drop_last = true)
f, _, report = MLJBase.fit(t, 1, X)

This is due to this line. I think replacing by classes(col) should work?

olivierlabayle avatar Aug 01 '22 12:08 olivierlabayle

Yes, great catch, that's a bug: https://github.com/JuliaAI/MLJModels.jl/issues/467

Are you willing an able to make a PR with a test?

ablaom avatar Aug 01 '22 20:08 ablaom

I can give it a try if it's as easy as my suggestion, can you grant me access to the repo?

olivierlabayle avatar Aug 02 '22 13:08 olivierlabayle

Done. You have an invitation to accept.

ablaom avatar Aug 02 '22 19:08 ablaom

https://github.com/JuliaAI/MLJModels.jl/pull/468

olivierlabayle avatar Aug 03 '22 07:08 olivierlabayle