handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

[IDEA] Chapter 8 - Reduce dimensionality for datasets with categorical features

Open marcio1191 opened this issue 4 years ago • 5 comments

Hello In this book its not covered how to handle dimensionality reduction for datasets with categorical features too. How would you handle these situations? Thank you in advance Regards

marcio1191 avatar Mar 24 '21 20:03 marcio1191

Hi @marcio1191, That's an interesting question! You could simply one-hot encode the categorical features and apply the dimensionality reduction algorithm after that. If you're training a neural network, you could apply the dimensionality reduction algorithm on the dataset excluding the categorical features, then add the categorical features as trainable embeddings. Hope this helps.

ageron avatar Mar 24 '21 21:03 ageron

Thanks @ageron for answering so quickly. The problem with one-hot encode is that for visualization and data reduction techniques, like t-SNE and PCA, there is no "distance/variance" meaning associated with that kind of encoding. The algorithm will work, however, there is no meaning associated with those features. https://stackoverflow.com/questions/40795141/pca-for-categorical-features

marcio1191 avatar Mar 25 '21 07:03 marcio1191

Thanks @marcio1191 , it seems I answered too quickly! 😅 You're right, if you choose the first option (using one-hot encoding followed by dim reduction), you have to be careful to use a clustering algorithm compatible with binary values (PCA definitely isn't). I haven't looked into this question very closely, but it seems that Multiple Component Analysis should work.

A couple other approaches you could try:

  • Of course you can try replacing the categorical feature with meaningful data before dim reduction. For example, suppose you're trying to predict a person's life satisfaction, and the training set contains a "city" categorical feature, then you could replace this "city" feature with one or more numerical features such as the city's mean income, crime rate, average rainfall per year, mean commute time, and other things about the city that may affect the life satisfaction of its inhabitants.
  • If you have access to pretrained embeddings (or if you can generate them using a separate neural net trained on related data), then you could replace the categorical features with the pretrained embeddings and then apply dim reduction.

Hope this helps.

ageron avatar Mar 25 '21 19:03 ageron

@ageron, Thank you for your help. Regards Marcio Fernandes

marcio1191 avatar Apr 10 '21 08:04 marcio1191

Hello, In this case, maybe encoding methods like Target Encoding or CatBoost encoding may help. There are multiple category encoders in http://contrib.scikit-learn.org/category_encoders/. Kind Regards Guillermo Fonseca

memo26167 avatar Dec 12 '21 18:12 memo26167