MLJ.jl icon indicating copy to clipboard operation
MLJ.jl copied to clipboard

Allow dropping of first level in OneHotEncoder

Open CameronBieganek opened this issue 4 years ago • 3 comments

When dropping a level of a categorical variable for one-hot encoding, R and Python both default to dropping the first level. It would be nice to have that option with OneHotEncoder in MLJ. For example, I have a data set with a bunch of columns that are coded like this:

Code Meaning
1 Yes
2 No
9 Missing

In this case I would prefer to drop the first level rather than the "missing" level. (Actually, I would really prefer to drop the "No" level...)

CameronBieganek avatar Jul 09 '20 17:07 CameronBieganek

How about we add a drop_first option and a drop_level option. So, eg, drop_level=9 or drop_level="No"?

ablaom avatar Jul 09 '20 21:07 ablaom

That sounds good! Although the drop_level case is a little tricky, since you might want to be able to control the dropped level on a per column basis. Scikit-learn allows the drop argument to be a vector, though I think a dictionary or array of pairs would probably work better than a vector. (There could be columns with continuous variables in between the categorical columns...)

CameronBieganek avatar Jul 09 '20 22:07 CameronBieganek

since you might want to be able to control the dropped level on a per column basis.

Right! 😳

I'd vote for the dictionary.

ablaom avatar Jul 10 '20 05:07 ablaom