category_encoders Catboost fit_transform method is broken.

TLDR

When called fit_transform, the output is shuffled and not in sync with the input. This does not occur when called transform() after fitting. This tidbit should either be explained in documentation or should be solved to give the expected (i.e. ordered) results (order being values in the same order as the categories).

Example

cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit(cats, y, return_df=True)
encoder.transform(cats)

gives

cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 
                     'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 
                     'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit_transform(cats, y, return_df=True)

gives (NOTICE DIFFERENT VALUES FOR TWO a's):

Apr 28 '22 00:04 PraveshKoirala

Hi @PraveshKoirala

this is not a bug. fit_transform calls transform(X, y) with the target information. As stated in catboost transform documentation

y : array-like, shape = [n_samples] when transform by leave one out None, when transform without target information (such as transform test set)

This always leaves out the current value. Hence we expect to see some differences. Indeed

cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit(cats, y, return_df=True)
encoder.transform(cats, y)

gives the same result as fit_transform. Does this make sense for you?

May 02 '22 09:05 PaulWestenthanner

I'm not quite sure though why our implementation uses this cumsum and cumcount. With this the output is dependent on the ordering of the input. I'm not super deep into catboost algorithm but I know that our implementation differs at some points from the catboost paper (and the "official" yandex implementation). Feel free to dig into it if you have time. These should be the relevant lines: https://github.com/scikit-learn-contrib/category_encoders/blob/12e20486f4422a56c802a0e04163a896271d4107/category_encoders/cat_boost.py#L269-L280

May 02 '22 09:05 PaulWestenthanner

@PaulWestenthanner your question is connected to #337. cumsum and cumcount introduce dependence on sorting, that's why category_encoders existing implementation of CatBoostEncoder is time-aware implementation, thus data should be sorted according to datetime column. I guess this fact should be mentioned in docs. And if data does not have time or does not time-sorted it still should work fine (as written in comments in code). Earlier implementation of CatBoostEncoder used LOO scheme to permute data, so it wasn't depended on sorting (time-unaware or has_time=False in CatBoost).

May 03 '22 13:05 glevv

I think we should update CatBoost documentation with this and #337 taken into account

Oct 29 '22 19:10 glevv

category_encoders category_encoders copied to clipboard

Catboost fit_transform method is broken.

TLDR

Example

category_encoders
category_encoders copied to clipboard