category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Target Encoder Giving Nan values

Open shauryauppal opened this issue 3 years ago • 4 comments

Expected Behavior

Target Encoder giving Nan values for few inputs

Same Issue:

  • https://stackoverflow.com/questions/68261917/why-is-target-encoder-encoding-some-values-as-nan

  • https://www.kaggle.com/questions-and-answers/204970

    • Version: Latest Pip Install Feb 2022 version
    • Platform: Linux
    • Subsystem: N/A

shauryauppal avatar Feb 16 '22 10:02 shauryauppal

Hi @shauryauppal Could you please also provide a dataset or even better a self-contained reproducible (minimal) example? Neither in the stackoverflow nor in the kaggle post the dataset is mentioned. (except for a reference to the kaggle housing prices competition which I can't seem to find)

PaulWestenthanner avatar Mar 05 '22 08:03 PaulWestenthanner

Maybe sigmoid function is numerically unstable in some cases and using something like scipy.special.expit((stats['count'] - self.min_samples_leaf) / self.smoothing) in here https://github.com/scikit-learn-contrib/category_encoders/blob/02a20aa96c5f1f234ec89a0f781980622e3b193a/category_encoders/target_encoder.py#L170 could be beneficial, both in terms of speed and stability. It will introduce dependency on scipy though.

But without minimal reproducible example it's a needle in a haystack.

glevv avatar Mar 08 '22 10:03 glevv

@GLevV

Numpy implements devision 0 by 0 as np.nan:

np.divide(0,0)

/tmp/ipykernel_2187/955440422.py:1: RuntimeWarning: invalid value encountered in true_divide
  np.divide(0,0)
nan

Thus if two conditions hold:

  1. self.min_samples_leaf == stats['count'] and
  2. self.smoothing == 0,

two variables 'smoove' and 'smoothing' in the formulas:

smoove = 1 / (1 + np.exp(-(stats['count'] - self.min_samples_leaf) / self.smoothing))
smoothing = prior * (1 - smoove) + stats['mean'] * smoove

would be equal to np.nan, giving the value np.nan for the category.

Note: We also need to take into account that current implementation has a bizzare line: smoothing[stats['count'] == 1] = prior that would prevent Nan value to appear for the category that only appears in a single line.

Example:

from category_encoders.target_encoder import TargetEncoder			
X = pd.DataFrame({'A': ['a', 'a']})
y = pd.Series([0, 1])
TargetEncoder(smoothing=0, min_samples_leaf=2).fit_transform(X, y)
#	A
#0	NaN
#1	NaN

MR0205 avatar Jun 02 '22 11:06 MR0205

neither stats["count"] nor self.smoothing should be 0. The former cannot even be 0 while for the second the documentation clearly states The value must be strictly bigger than 0. Without a reproducible example by @shauryauppal we cannot do anything here

PaulWestenthanner avatar Jun 02 '22 14:06 PaulWestenthanner