category_encoders
category_encoders copied to clipboard
OneHotEncoder(sparse=True)
sklearn.preprocessing.OneHotEncoder has the option sparse=True
, to return the output in a scipy.sparse matrix. This can be really useful if you have categories with high cardinality.
Would it be possible to add a sparse=True
option to category_encoders.one_hot.OneHotEncoder
?
This is a valid feature request. Commits are welcomed.
Awesome, thanks. I'll take a look at the code, and if I can easily implement it on my own, I'll take a stab at it.
May I ask why the project wants to re-implement encoders that are already part of sklearn? I thought it was complementing sklearn in way by only adding encoders that are not available there yet?
I see two reasons why it may be desirable:
- For being able to quickly compare multiple encoders. By having them all in a single package, you may reasonably expect them to use the same interface everywhere, making it easy to compare one method with another.
- Support for pandas DataFrames.
I totally agree with point 1.
For point 2 I think sklearn also supports pandas DataFrames:
>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> df = pd.DataFrame([("a",), ("b", )], columns=["foo"])
>>> enc.fit(df)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform(df)
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
This is for sklearn version 0.21.2
Concerning the issue: What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.
category_encoders.one_hot.OneHotEncoder
has 2 additional features I often use that are not in sklearn.preprocessing.OneHotEncoder:
-
drop_invariant=True
to drop columns with zero variance (e.g. a categorical feature that is all one level). -
handle_missing=True
to encode NaNs as their own level (rather than erroring).
Honestly, I think it might make sense to open a PR to sklearn to port these 2 features from category_encoders
What do you all think?
Also, category_encoders has category_encoders.ordinal.OrdinalEncoder
while sklearn has sklearn.preprocessing.OrdinalEncoder
. In this case OrdinalEncoder has 3 features missing from sklearn:
-
drop_invariant=True
. -
handle_missing=True
. -
handle_unknown=True
to handle encoding for new categories.
What's interesting is that the sklearn OneHotEncoder has an handle_unknown
option while the sklearn OneHotEncoder
does not.
One thing I really like about category_encoders is that every encoder (except hashing, which doesn't need it) has an handle_missing
and handle_unknown
option. It'd be really useful to have both of these options in the sklearn encoders too.
Good to see improving support for pandas DataFrames in sklearn.
What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.
I am leaving the decision up to @wdm0006. It would be necessary:
- Wrap OneHotEncoder and OrdinalEncoder.
- Get the wrappers to pass the tests. Or change the unit tests and all the remaining encoders to behave more like sklearn encoders. Possible difficulties: different/missing arguments like
mapping
,handle_missing
orhandle_unknown
. And handling of "ordered" Categoricals from pandas.
The best possible outcome I can think of is adding the missing functionality into sklearn and porting category_encoders to use sklearn encoders.
Edited: @zachmayer was faster. But good to see the similarity of the ideas.
Should I make a feature request on sklearn to add handle_missing
and handle_unknown
to their cat encoders?
Here's a request to add handle_missing to the OneHotEncoder in sklearn https://github.com/scikit-learn/scikit-learn/issues/11996
I also opened an issue for OrdinalEncoder: https://github.com/scikit-learn/scikit-learn/issues/17123