category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Example of TargetEncoder w/ categorical target

Open bkj opened this issue 6 years ago • 8 comments

Is there an example of using TargetEncoder w/ a categorical target variable? The docstring suggests that it should be possible, but I don't see how the code is determining that y is continuous vs. categorical.

Am I supposed to pass categorical y as a vector of strings? As an (n_obs, n_classes) array of one-hot encoded labels? I tried a few things, but they don't seem to work.

The code takes the mean of y -- which seems weird when y is categorical.

bkj avatar Apr 23 '19 04:04 bkj

All encoders should accept pd.Categorical:

y_categorical = pd.Categorical(y[0])

But TargetEncoder does not accept that. Hence, that is a bug.


Also, I am not sure that TargetEncoder currently handles polynomial targets correctly. The way it should handle it is described in article A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems in section Extension to Multi-Valued Categorical Targets. In short, for each encoded feature, it should create m-1 columns, where m is the count of unique values in the target.

There are four things to do:

  1. Write a parameterized unit tests for binomial pd.Categorical targets. A good start is to clone and modify test_classification() in test_encoders.py.
  2. Write a parameterized unit tests for polynomial pd.Categorical targets. It is ok to skip in the test encoders that you are not interested in and that do not work out of box.
  3. Fix TargetEncoder.
  4. Write a pull request.

janmotl avatar Apr 23 '19 08:04 janmotl

OK thanks. What do you mean by polynomial targets?

bkj avatar Apr 23 '19 13:04 bkj

In unit tests, we currently test only targets with {True, False}. That's insufficient.

Since each supervised encoder (e.g.: TargetEncoder, WOEEncoder,...) should support binary targets, we should test each encoder with following targets:

  1. Strings like: {'Apple', 'Banana'}
  2. Integers like: {-1000, 2000}. I used "ugly" numbers since if it works for them, it should also work for "nice" numbers like {0, 1}.
  3. Boolean, just like we already do so.
  4. Pandas Categories with two unique values.

The encoders should always encode the datasets the same way regardless of the used target representation.


TargetEncoder should also be tested on polynomial targets (categorical targets with more than 2 unique values) like:

  1. Strings: {'Apple', 'Banana', 'Cinnamon'}
  2. Integers: {-1000, 2000, 4000}
  3. Pandas Categories with three unique values.

Finally, TargetEncoder should also be tested on continuous targets, like doubles, since it should support regression tasks (see Continuous Targets in the article).

janmotl avatar Apr 23 '19 14:04 janmotl

Ah ok -- I was confused by the term "polynomial". I think a more standard term is "multiclass classification" for classification w/ more than 2 unique target values. "Polynomial" makes me think of f(x) = a * x ** 2 + b * x + c...

bkj avatar Apr 23 '19 14:04 bkj

I observed the same lack of functionality in TargetEncoder and LeaveOneOutEncoder. Then I went after Barreca's paper to find out what was missing, and basically what is proposed is:

  1. One-hot-encode the categorical target variable, except one category (so they are linearly independent)
  2. For each new binary target, encode the categorical independent variable with the proposed technique for binary target (i.e. several new predictors will be created, instead of only one)

This only is useful when the categorical target has much lower cardinality than the independent variable you're trying to encode.

A better alternative might be the one described in Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine, available on arXiv.

willrazen avatar May 15 '19 10:05 willrazen

The implementation of the referenced article is at: https://github.com/aslakey/CBM_Encoding

In short: instead of returning just "the average of the target", as TargetEncoder does, it also returns "the variance of the target". And that's definitely interesting.

However, when we execute https://github.com/aslakey/CBM_Encoding/blob/master/run_dirichlet_experiments.py on car data set, which has label with 4 classes and 6 nominal features, and calculate both, avg() and var(), we end up with 48 features (4*6*2=48). Hence, I don't think it really solves the issue with the high-cardinality dependent variable...

janmotl avatar May 15 '19 13:05 janmotl

It seems to be broken with any categorical target (even if it is binary):

from category_encoders import HashingEncoder, TargetEncoder
import numpy as np
import pandas as pd

if __name__ == '__main__':
    enc = TargetEncoder
    x = np.random.randint(low=0, high=5, size=(150, 4))
    y = np.random.randint(low=0, high=2, size=(150,))

    x_cat = pd.DataFrame(x)
    for col in x_cat.columns:
        x_cat[col] = x_cat[col].astype('category')
    y_cat = pd.Series(y, dtype='category')
    enc().fit(x_cat, y_cat)

produces

Traceback (most recent call last):
  File ".../mwece.py", line 14, in <module>
    enc().fit(x_cat, y_cat)
  File "...\venv\lib\site-packages\category_encoders\target_encoder.py", line 142, in fit
    self.mapping = self.fit_target_encoding(X_ordinal, y)
  File "...\lib\site-packages\category_encoders\target_encoder.py", line 168, in fit_target_encoding
    prior = self._mean = y.mean()
  File "...\lib\site-packages\pandas\core\generic.py", line 11214, in stat_func
    return self._reduce(
  File "...\venv\lib\site-packages\pandas\core\series.py", line 3872, in _reduce
    return delegate._reduce(name, skipna=skipna, **kwds)
  File "...\venv\lib\site-packages\pandas\core\arrays\categorical.py", line 2124, in _reduce
    raise TypeError(f"Categorical cannot perform the operation {name}")
TypeError: Categorical cannot perform the operation mean

PGijsbers avatar Jan 12 '21 11:01 PGijsbers

Is anyone working on this currently? Is there a difference between this TargetEncoder and the SKLearn LabelEncoder or does this library implement LabelEncoder elsewhere?

Tangentially, are the unit tests set up here to ensure that the encoders yield the same results as their SKLearn counterparts, since this library touts full compatibility with sklearn pipelines? This would be particularly important for those wanting to use this library as a drop-in replacement within their SKLearn workflows.

Shellcat-Zero avatar Nov 09 '21 21:11 Shellcat-Zero