MLJ.jl
MLJ.jl copied to clipboard
Allow dropping of first level in OneHotEncoder
When dropping a level of a categorical variable for one-hot encoding, R and Python both default to dropping the first level. It would be nice to have that option with OneHotEncoder
in MLJ. For example, I have a data set with a bunch of columns that are coded like this:
Code | Meaning |
---|---|
1 | Yes |
2 | No |
9 | Missing |
In this case I would prefer to drop the first level rather than the "missing" level. (Actually, I would really prefer to drop the "No" level...)
How about we add a drop_first
option and a drop_level
option. So, eg, drop_level=9
or drop_level="No"
?
That sounds good! Although the drop_level
case is a little tricky, since you might want to be able to control the dropped level on a per column basis. Scikit-learn allows the drop
argument to be a vector, though I think a dictionary or array of pairs would probably work better than a vector. (There could be columns with continuous variables in between the categorical columns...)
since you might want to be able to control the dropped level on a per column basis.
Right! 😳
I'd vote for the dictionary.