MLJModels.jl
MLJModels.jl copied to clipboard
Enhance treatment of missing value in one-hot encoder
There is now missing value handling in OneHotEncoder
but this simply propagates the missing values. I guess it might be nice to offer some other popular options for handling missing values which might be complicated to handle in a post-processing step. See also the discussion here.
@Chandu-4444 @Frank-III @OlivierLabayle
The current implementation comes under the all-missing
case. This is the easiest and most straightforward case I can say. Any other cases like all-zero
, and category
can also be implemented, and I guess I can use a part of my previous commit (link) for incorporating these. A simple modification to it and the current implementation for handling missing values in OneHotEncoder can enable all the above-mentioned methods.
Any other ideas would be most welcomed.
all-zero
looks like the simplest. One question for category
is how to handle missing
values that appear for a feature that did not havemissing
values in training (fit
). Here's a proposal for this:
We introduce a new hyper-parameter features_with_missing
which can either be: (i) a vector of feature names, (ii) the symbol :all
, (iii) the symbol :auto
. For such features, when specified as a vector, we will always have the extra missing
category, regardless of the existence of missing
values in the input for transform
. If features_with_missings == :auto
then the actual list used is inferred from the training data: a feature is on the list if missing
appears for that feature in the training data. If features_with_missings === :all
then every feature gets the extra missing category.
In transform, if missing
appears for a feature not on the list, then an informative error is thrown, explaining the possibility that the problem can be corrected by retraining and explicitly specifying features_with_missings
appropriately.
The default could be :all
or :auto
. Maybe :auto
is okay. It might lead to a surprise for the user that never reads documentation, but the error message explains what to do.
We will also need a hyper-parameter to specify the kind of missing handling - :propogate
, :all_zero
or :category
. Name suggestion: ~~missing_handling
~~ handle_missing
(for consistency with sk-learn). Default: :propogate
. If missing_handling
is not :category
, and features_with_missing
is not it's default value, then clean!
should issue a warning that features_with_missing
is being ignored. Or we could combine the two new hyper-parameters into one somehow, although I'm not sure how to do this without creating cognitive dissonance.
I wonder how this is handled elsewhere. Of course, often one-hot encoding is sometimes implemented as a "static" transformer (no seperate training step) and this doesn't come up. This is not, however, an argument for making it static, in my view. I think it is preferable to have a consistent number of spawned features in the output, each time transform
is called. That is, by training just once, you can arrange that the number of spawned features does not depend on whether there are - or are not - missing values in a particular field to be transformed. For otherwise downstream operations, expecting a certain number of features might fail unexpectedly.
Anyone have a different suggestion?
Probably good to introduce the two options in separate PR's, starting with the easiest all-zero
case.
This page can help relate a few things said by @ablaom.
Is this how the output should be for the minimal all-zero
case?
julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)
julia> enc = OneHotEncoder(missing_handling = "all-zero")
# After some steps ...
(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
name_missing = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
No, rather it's the same as the current behaviour, except instead of missing
s, use zeros. You don't need to spawn an extra column in this case:
julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)
julia> enc = OneHotEncoder(handle_missing = :all_zero)
# After some steps ...
(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0])
However, note that this means we cannot have drop_last=true
in this case, because then we can't distinguish missing
from the last class. So I suggest clean!
needs to check this. I suggest that if handle_missing == :all_zero
, then clean!
changes drop_last
to false
, if it is true
, issuing a warning in that case.
Also:
- let's use the name
handle_missing
for consistency with sk-learn - let's use symbols for it's values, not strings
Thank you for adding the support for propagating missing values! I think I have identified a bug if the first value in a vector is missing:
using MLJModels, CategoricalArrays, MLJBase
X = (x=categorical([missing, 1, 2, 1]),)
t = OneHotEncoder(drop_last = true)
f, _, report = MLJBase.fit(t, 1, X)
This is due to this line. I think replacing by classes(col)
should work?
Yes, great catch, that's a bug: https://github.com/JuliaAI/MLJModels.jl/issues/467
Are you willing an able to make a PR with a test?
I can give it a try if it's as easy as my suggestion, can you grant me access to the repo?
Done. You have an invitation to accept.
https://github.com/JuliaAI/MLJModels.jl/pull/468