category_encoders
category_encoders copied to clipboard
Catboost fit_transform method is broken.
TLDR
When called fit_transform, the output is shuffled and not in sync with the input. This does not occur when called transform() after fitting. This tidbit should either be explained in documentation or should be solved to give the expected (i.e. ordered) results (order being values in the same order as the categories).
Example
cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit(cats, y, return_df=True)
encoder.transform(cats)
gives
cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'],
'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'],
'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit_transform(cats, y, return_df=True)
gives (NOTICE DIFFERENT VALUES FOR TWO a's):
Hi @PraveshKoirala
this is not a bug.
fit_transform
calls transform(X, y)
with the target information. As stated in catboost transform documentation
y : array-like, shape = [n_samples] when transform by leave one out None, when transform without target information (such as transform test set)
This always leaves out the current value. Hence we expect to see some differences. Indeed
cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit(cats, y, return_df=True)
encoder.transform(cats, y)
gives the same result as fit_transform
.
Does this make sense for you?
I'm not quite sure though why our implementation uses this cumsum and cumcount. With this the output is dependent on the ordering of the input. I'm not super deep into catboost algorithm but I know that our implementation differs at some points from the catboost paper (and the "official" yandex implementation). Feel free to dig into it if you have time. These should be the relevant lines: https://github.com/scikit-learn-contrib/category_encoders/blob/12e20486f4422a56c802a0e04163a896271d4107/category_encoders/cat_boost.py#L269-L280
@PaulWestenthanner your question is connected to #337.
cumsum
and cumcount
introduce dependence on sorting, that's why category_encoders existing implementation of CatBoostEncoder
is time-aware implementation, thus data should be sorted according to datetime column. I guess this fact should be mentioned in docs. And if data does not have time or does not time-sorted it still should work fine (as written in comments in code).
Earlier implementation of CatBoostEncoder
used LOO scheme to permute data, so it wasn't depended on sorting (time-unaware or has_time=False
in CatBoost).
I think we should update CatBoost documentation with this and #337 taken into account