feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

[FEATURE] Categorical Feature Embedding via MLP

Open Pacman1984 opened this issue 3 years ago • 1 comments

Using an Embeddinglayer of a deep neural network to encode categorical variables is a common practice also known as "entity embedding". One problem is, that you have to install a deeplearning library like pytroch or tensorflow to implement the neural network with an embedding layer and they are heavy libraries.

One alternative approach could be to use the MLPClassifier, MLPRegressor class from sklearn.neural_network build a neural network of your choice and use the last layer before the classification head to encode categorical variables. One could encode each variable alone or input more variable at once to encode so that the interactions can be found by the neural network.

I have coded this feature and you can take a look at mlpencoder. I would implement this feature in feature engine if you agree.

Examples: https://github.com/Pacman1984/mlpencoder/blob/master/mlpencoder/test.ipynb

Pacman1984 avatar May 02 '22 11:05 Pacman1984

Thanks for the suggestion.

I've heard of this before, but I have to say I am not deeply familiar with these types of embeddings and when we could/should use them.

If you'd like to draft a PR, go for it. Meanwhile, I'll try to get up to speed with papers on the topic.

What would be the advantage of using embeddings for categorical variables? we won't be able to interpret features after that, right?

One important thing: we would need to create a lot of documentation to guide users to correctly use these embeddings.

solegalli avatar May 05 '22 06:05 solegalli