feature_engine
feature_engine copied to clipboard
One hot encoder using another criteria instead of frequency in top categories
I did not check papers about it, but in one project, I wanted to use OHE and ordinal encoding to take top categories but not by frequency, at that time, it was the top of the categories with the best sales. Because for that specific problem, not necessarily the more relevant categories I wanted to select had the highest frequency. If you think about it, sometimes you want to select the top categories of long-tail according to another variable in the dataset. It could include a parameter in these functions to select top categories considering another variable in the dataset or something similar. What do you think?
Hi @indymnv
Haven't used this before. Sounds sensible. I would not add a parameter in the existing classes though. I would instead create a new class that maybe can do both OHE or ordinal encoding based on frequencies from other reference variable.
If happy with that, feel free to give it a go.
Cheers
Hi @solegalli
Well, the idea was just born for a specific problem I tried to solve. Perhaps I will check more about this possibility or hear more people's opinion before building a function or class may be too specific. Let me know if you have another idea. I am going to pick up another issue on the list.
Let me know if you have another idea.
Kind regards.
Sounds good. Let's see if there is interest in this issue. We can always come back to it.