category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Contrast coding schemes for multiclass labels

Open yashjakhotiya opened this issue 6 years ago • 3 comments

How would you encode high cardinality categorical variables if the target variable is another high cardinality categorical variable? Can not use OneHotEncoding because of high cardinality. Can not use LabelEncoding because there is no inherent order. Can not use Target Encoding / Contrast Encoding because they all compute 'mean' of the target variable, which in case of multiclass labels won't make any sense. How to approach this problem?

yashjakhotiya avatar May 29 '19 09:05 yashjakhotiya

This topic was discussed in issue #182.

If the label has a low count of unique values (e.g.: less than 30), it is possible to extend TargetEncoder (as discussed in #182) and it will work well (given enough training data).

If the label has truly high cardinality (e.g.: more than 1000 unique values), consider using HashingEncoder.

janmotl avatar May 29 '19 10:05 janmotl

Thanks for the response.

The target label has more than 100 possible values. I don't think OneHotEncoder on target variable/ one vs all method/ training multiple predictors is a feasible approach.

By HashingEncoder, do you mean using it on the target variable and then use extended TargetEncoder or directly using HashingEncoder on the categorical features?

yashjakhotiya avatar May 29 '19 11:05 yashjakhotiya

Directly use HashingEncoder on the categorical features. The advantage of this encoding is that the count of the generated features does not depend on the cardinality of the features nor the cardinality of the label (as it is an unsupervised method just like OneHotEncoder).

janmotl avatar May 29 '19 12:05 janmotl