ML Enable the OneHotEncoder to be able to drop categories

Hi,

I've been working with a sparse dataset, in which my '?' category should really be represented as none of the generated features being hot when using the OneHotEncoder.

This contribution adds this as a backwards-compatible option to the encoder.

May 27 '24 11:05 27pchrisl

Thanks Andrew!

Jul 02 '24 13:07 27pchrisl

Hey @27pchrisl I'm interested to know if you've thought of other approaches ... for example, filtering specific categories from the dataset before OneHotEncoding it. Would a "CategoryDropper" Transformer allow for the same outcome when paired with OneHotEncoder but also serve other useful purposes? I get that you'd have to replace the category with something (perhaps a missing data placeholder ex. '?') and so it's not really "dropping" the category but maybe this could be handled by making OneHotEncoder "missing data aware" and ignore those data.

I think if we can rule out there being no better alternatives than to handle the "dropping" of categories in the OneHotEncoder, then this is a go.

Also, I'm just a tiny bit concerned about there being no discrimination between feature columns here. Like if the same set of categories were used to describe different features. You wouldn't have control over which columns to operate on it would always be all of them. This is not a deal-breaker for me though - just something we would want to make special note of in the documenation.

Jul 09 '24 01:07 andrewdalpino

Hi @andrewdalpino, yep I agree that if you have a feature where many categories should be not hot, the author should transform that outside of the OneHotEncoder so it can just do its own job. Similar to preparing using the MissingDataImputer. Then the OHE only needs to be told which single category should be dropped, probably defaulting to '?'.

I took inspiration from the signature from scikit-learn, which probably isn't the best source since python libraries tend to really overload their parameters ☺️

I'm using a very sparse dataset (CRM data), so I definitely need the capability for a none-hot category to prevent the model thinking the absence of a category is a category in itself. Absence represents poor quality data rather than a deliberate choice. My goal was to have the model ignore the feature in that case.

Jul 09 '24 09:07 27pchrisl

ML ML copied to clipboard

Enable the OneHotEncoder to be able to drop categories

ML
ML copied to clipboard