ML icon indicating copy to clipboard operation
ML copied to clipboard

Enable the OneHotEncoder to be able to drop categories

Open 27pchrisl opened this issue 1 year ago • 3 comments

Hi,

I've been working with a sparse dataset, in which my '?' category should really be represented as none of the generated features being hot when using the OneHotEncoder.

This contribution adds this as a backwards-compatible option to the encoder.

27pchrisl avatar May 27 '24 11:05 27pchrisl

Thanks Andrew!

27pchrisl avatar Jul 02 '24 13:07 27pchrisl

Hey @27pchrisl I'm interested to know if you've thought of other approaches ... for example, filtering specific categories from the dataset before OneHotEncoding it. Would a "CategoryDropper" Transformer allow for the same outcome when paired with OneHotEncoder but also serve other useful purposes? I get that you'd have to replace the category with something (perhaps a missing data placeholder ex. '?') and so it's not really "dropping" the category but maybe this could be handled by making OneHotEncoder "missing data aware" and ignore those data.

I think if we can rule out there being no better alternatives than to handle the "dropping" of categories in the OneHotEncoder, then this is a go.

Also, I'm just a tiny bit concerned about there being no discrimination between feature columns here. Like if the same set of categories were used to describe different features. You wouldn't have control over which columns to operate on it would always be all of them. This is not a deal-breaker for me though - just something we would want to make special note of in the documenation.

andrewdalpino avatar Jul 09 '24 01:07 andrewdalpino

Hi @andrewdalpino, yep I agree that if you have a feature where many categories should be not hot, the author should transform that outside of the OneHotEncoder so it can just do its own job. Similar to preparing using the MissingDataImputer. Then the OHE only needs to be told which single category should be dropped, probably defaulting to '?'.

I took inspiration from the signature from scikit-learn, which probably isn't the best source since python libraries tend to really overload their parameters ☺️

I'm using a very sparse dataset (CRM data), so I definitely need the capability for a none-hot category to prevent the model thinking the absence of a category is a category in itself. Absence represents poor quality data rather than a deliberate choice. My goal was to have the model ignore the feature in that case.

27pchrisl avatar Jul 09 '24 09:07 27pchrisl