category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Circular categories encoding

Open DelgadoPanadero opened this issue 5 years ago • 3 comments

Hi! I came up here searching about how to encode categorical variables which have a circular distance relation (such as the days of the week, where the last day, sunday, is very close to the firstone, monday) preversing this characteristic.

I think that none of the encodings of this package support this bahaviour. Am i right? If this is true I have some ideas about how to implement this idea. If I develop this, would you like to add it as a pull request?

DelgadoPanadero avatar Dec 06 '19 18:12 DelgadoPanadero

Hi. I am curious to see what do you propose.

janmotl avatar Dec 06 '19 19:12 janmotl

I think that may be different solutions according to the problem that you are tackling. For, instance, in the case of the days of the week that I have mentioned before, mostly everyone will use a integer variable from 1 to 7 to encode these days as a number:

int_day(thursday) = 4

In my opinion, a better aproach could be to use two variables rather than only only wich represent the x and y as follows:

x = cos( 2pi * int_day/7)
y = sin( 2pi * int_day/7)

With this representation the distance between every day of the week is the same still in the case of sunday and monday (last day and first day). With this "circular" representation, the transformation would be something like this

circular_representation(thursday) = ( cos(2pi * int_day(thursday)/7), sin(2pi * int_day(thursday)/7) )

I think that it just solves the euclidean problem but there are other problems that may require other representations as well as other kinds of dependance between categorical variables, not just this "circuar" case.

DelgadoPanadero avatar Dec 22 '19 15:12 DelgadoPanadero

I was testing this transformation in the past with different models. And it never lead to an improvement. From that time, whenever I have cyclical features and feel the need to preserve the circularity, I just use distance-based models with a distance, which respects the circularity.

For a non-exhaustive critique of this transformation, see the comment from T. Bush at https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/.

Nevertheless, if you find (and document) at least one scenario (on some real dataset) when this transformation improves the accuracy of the model (and it's not just random fluctuation), the transformation will be a welcomed extension of this library.

janmotl avatar Dec 23 '19 12:12 janmotl