category_encoders
category_encoders copied to clipboard
Calling .fit_transform in CatBoostEncoder doesn't work
I'm putting CatBoostEncoder in a pipeline just before a RandomForestClassifier, but I'm getting an error in the RF due to all the values being NaNs. If I manually call .fit(X, y) and then .transform(X) then all works. But if I try to call .fit_transform(X, y) then the output is all NaNs.
Expected Behavior
A successfully trained RF classifier.
Actual Behavior
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Steps to Reproduce the Problem
Fully reproducible example:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
X = ['a', 'a', 'b', 'c']
y = [1, 1, 1, 0]
cb = ce.CatBoostEncoder()
rf = RandomForestClassifier()
print('Using .fit() and then .transform()')
cb.fit(X, y)
X_tr = cb.transform(X)
rf.fit(X_tr, y)
print('Success!\n')
print('Using .fit_transform()')
rf_pipeline = Pipeline(steps=[
('preprocesser', cb),
('classifier', rf)
])
rf_pipeline.fit(X,y)
Output:
Using .fit() and then .transform()
Success!
Using .fit_transform()
Traceback (most recent call last):
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Specifications
- Version: 2.2.2
- Platform: Windows-10-10.0.18362-SP0
- Subsystem: Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
For anyone else, quick fix in your own code:
import category_encoders as ce
class CatBoostEnc(ce.CatBoostEncoder):
def fit_transform(self, X, y):
return super().fit(X, y).transform(X)
cbe = CatBoostEnc()
There you have your encoder with a working .fit_transform().
I can't reproduce this. Has something been fixed in the meantime?