category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Calling .fit_transform in CatBoostEncoder doesn't work

Open ManuelZ opened this issue 5 years ago • 2 comments

I'm putting CatBoostEncoder in a pipeline just before a RandomForestClassifier, but I'm getting an error in the RF due to all the values being NaNs. If I manually call .fit(X, y) and then .transform(X) then all works. But if I try to call .fit_transform(X, y) then the output is all NaNs.

Expected Behavior

A successfully trained RF classifier.

Actual Behavior

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Steps to Reproduce the Problem

Fully reproducible example:

import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

X = ['a', 'a', 'b', 'c']
y = [1, 1, 1, 0]

cb = ce.CatBoostEncoder()
rf = RandomForestClassifier()

print('Using .fit() and then .transform()')
cb.fit(X, y)
X_tr = cb.transform(X)
rf.fit(X_tr, y)
print('Success!\n')

print('Using .fit_transform()')
rf_pipeline = Pipeline(steps=[
    ('preprocesser', cb),
    ('classifier', rf)
])
rf_pipeline.fit(X,y)

Output:

Using .fit() and then .transform()
Success!

Using .fit_transform()
Traceback (most recent call last):
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Specifications

  • Version: 2.2.2
  • Platform: Windows-10-10.0.18362-SP0
  • Subsystem: Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]

ManuelZ avatar Aug 31 '20 01:08 ManuelZ

For anyone else, quick fix in your own code:

import category_encoders as ce

class CatBoostEnc(ce.CatBoostEncoder):
    def fit_transform(self, X, y):
        return super().fit(X, y).transform(X)

cbe = CatBoostEnc()

There you have your encoder with a working .fit_transform().

ManuelZ avatar Aug 31 '20 01:08 ManuelZ

I can't reproduce this. Has something been fixed in the meantime?

bmreiniger avatar Oct 30 '20 14:10 bmreiniger