feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Encoding ordinal variables

Open david-cortes opened this issue 2 years ago • 7 comments

Oftentimes, one wants to build linear models having ordinal variables as features (e.g. "rate in a scale from 1 to 5 ..."). One might treat these as numerical or categorical, but this loses some information.

Would be nice to have ordinal versions of some typical categorical encoders, such as mean/frequency encoders that would do the grouping by a condition x<=c instead of x==c.

david-cortes avatar Feb 03 '23 11:02 david-cortes

Hi @david-cortes

Thanks for the suggestion!

I am not sure I understand what the output of the encoder should be. Could you give us an example?

solegalli avatar Feb 04 '23 18:02 solegalli

For example, if there is a column taking possible values [1, 2, 3] and we want an ordinal mean encoding, the mapping would be:

1 -> mean(y[x <= 1])
2 -> mean(y[x <= 2])
3 -> mean(y[x <= 3])

i.e. a mean calculated by grouping over rows that have a value <= than a threshold in the column being encoded (so the calculation for a value of 2 would also involve rows with a value of 1), instead of a mean calculated by grouping over each value separately.

david-cortes avatar Feb 05 '23 10:02 david-cortes

thank you

solegalli avatar Feb 06 '23 14:02 solegalli

I second this. This type of encoding is very useful for linear modeling especially. It has an averaging effect on ordinal variables that is much more stable than simple one-hot encoding.

@solegalli if I get a pull request together along with examples of how it is beneficial, is this something the team would consider merging?

AnotherSamWilson avatar Mar 12 '23 16:03 AnotherSamWilson

Hey @AnotherSamWilson

Thanks for joining this discussion.

Yes, we tend to be quite open towards new functionality.

I've never heard of / read about this type of encoding. Is there an article that you could link for more info? Or is this something that you guys do practically? common practice in some industry?

To make it meaningful for potential users, we would have to add, besides the functionality, a good user guide with examples of how to use this class, and explanations about what constitutes a good use case for this type of encoding. You seem to have it covered though, because you mention examples of how this would be beneficial. So go for it!

I look forward to the PR :)

solegalli avatar Mar 12 '23 19:03 solegalli

Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?

kylegilde avatar Jun 16 '23 22:06 kylegilde

Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?

No, because it'd require overlaps between rows.

david-cortes avatar Jun 17 '23 11:06 david-cortes