category_encoders
category_encoders copied to clipboard
Contrast coding schemes for multiclass labels
How would you encode high cardinality categorical variables if the target variable is another high cardinality categorical variable? Can not use OneHotEncoding because of high cardinality. Can not use LabelEncoding because there is no inherent order. Can not use Target Encoding / Contrast Encoding because they all compute 'mean' of the target variable, which in case of multiclass labels won't make any sense. How to approach this problem?
This topic was discussed in issue #182.
If the label has a low count of unique values (e.g.: less than 30), it is possible to extend TargetEncoder (as discussed in #182) and it will work well (given enough training data).
If the label has truly high cardinality (e.g.: more than 1000 unique values), consider using HashingEncoder.
Thanks for the response.
The target label has more than 100 possible values. I don't think OneHotEncoder on target variable/ one vs all method/ training multiple predictors is a feasible approach.
By HashingEncoder, do you mean using it on the target variable and then use extended TargetEncoder or directly using HashingEncoder on the categorical features?
Directly use HashingEncoder on the categorical features. The advantage of this encoding is that the count of the generated features does not depend on the cardinality of the features nor the cardinality of the label (as it is an unsupervised method just like OneHotEncoder).