category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

OneHotEncoder(sparse=True)

Open zachmayer opened this issue 5 years ago • 11 comments

sklearn.preprocessing.OneHotEncoder has the option sparse=True, to return the output in a scipy.sparse matrix. This can be really useful if you have categories with high cardinality.

Would it be possible to add a sparse=True option to category_encoders.one_hot.OneHotEncoder?

zachmayer avatar Jan 07 '20 16:01 zachmayer

This is a valid feature request. Commits are welcomed.

janmotl avatar Jan 07 '20 19:01 janmotl

Awesome, thanks. I'll take a look at the code, and if I can easily implement it on my own, I'll take a stab at it.

zachmayer avatar Jan 07 '20 19:01 zachmayer

May I ask why the project wants to re-implement encoders that are already part of sklearn? I thought it was complementing sklearn in way by only adding encoders that are not available there yet?

PaulWestenthanner avatar May 02 '20 18:05 PaulWestenthanner

I see two reasons why it may be desirable:

  1. For being able to quickly compare multiple encoders. By having them all in a single package, you may reasonably expect them to use the same interface everywhere, making it easy to compare one method with another.
  2. Support for pandas DataFrames.

janmotl avatar May 03 '20 08:05 janmotl

I totally agree with point 1.
For point 2 I think sklearn also supports pandas DataFrames:

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> df = pd.DataFrame([("a",), ("b", )], columns=["foo"])
>>> enc.fit(df)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=True)
>>> enc.transform(df)
<2x2 sparse matrix of type '<class 'numpy.float64'>'
        with 2 stored elements in Compressed Sparse Row format>

This is for sklearn version 0.21.2

Concerning the issue: What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.

PaulWestenthanner avatar May 03 '20 12:05 PaulWestenthanner

category_encoders.one_hot.OneHotEncoder has 2 additional features I often use that are not in sklearn.preprocessing.OneHotEncoder:

  1. drop_invariant=True to drop columns with zero variance (e.g. a categorical feature that is all one level).
  2. handle_missing=True to encode NaNs as their own level (rather than erroring).

Honestly, I think it might make sense to open a PR to sklearn to port these 2 features from category_encoders

What do you all think?

zachmayer avatar May 03 '20 12:05 zachmayer

Also, category_encoders has category_encoders.ordinal.OrdinalEncoder while sklearn has sklearn.preprocessing.OrdinalEncoder. In this case OrdinalEncoder has 3 features missing from sklearn:

  1. drop_invariant=True.
  2. handle_missing=True.
  3. handle_unknown=True to handle encoding for new categories.

What's interesting is that the sklearn OneHotEncoder has an handle_unknown option while the sklearn OneHotEncoder does not.

One thing I really like about category_encoders is that every encoder (except hashing, which doesn't need it) has an handle_missing and handle_unknown option. It'd be really useful to have both of these options in the sklearn encoders too.

zachmayer avatar May 03 '20 12:05 zachmayer

Good to see improving support for pandas DataFrames in sklearn.

What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.

I am leaving the decision up to @wdm0006. It would be necessary:

  1. Wrap OneHotEncoder and OrdinalEncoder.
  2. Get the wrappers to pass the tests. Or change the unit tests and all the remaining encoders to behave more like sklearn encoders. Possible difficulties: different/missing arguments like mapping, handle_missing or handle_unknown. And handling of "ordered" Categoricals from pandas.

The best possible outcome I can think of is adding the missing functionality into sklearn and porting category_encoders to use sklearn encoders.

Edited: @zachmayer was faster. But good to see the similarity of the ideas.

janmotl avatar May 03 '20 12:05 janmotl

Should I make a feature request on sklearn to add handle_missing and handle_unknown to their cat encoders?

zachmayer avatar May 04 '20 14:05 zachmayer

Here's a request to add handle_missing to the OneHotEncoder in sklearn https://github.com/scikit-learn/scikit-learn/issues/11996

zachmayer avatar May 04 '20 14:05 zachmayer

I also opened an issue for OrdinalEncoder: https://github.com/scikit-learn/scikit-learn/issues/17123

zachmayer avatar May 04 '20 14:05 zachmayer