imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

In SMOTENC - why the median std is halved to estimate the distance of the categorical features?

Open solegalli opened this issue 2 years ago • 1 comments

In this line, when adding the median(std) to the OHE matrix to estimate the distance of categorical features, the median is divided by 2.

Is this a bug? or is this intentional? and if intentional, why?

thanks a lot!

solegalli avatar Sep 02 '21 15:09 solegalli

Hey @solegalli it's been a while since you opened this issue, but I just replied to the other issue you opened. It's a bug in the sense that it should be divided by 2**(1/2) instead of 2. But it was done like this because the features are one-hot encoded, so when computing the euclidean distance between two observations with a different value in a categorical feature the summation of the squared differences would be Med**2. However, the way it is implemented, the importance of the categorical features are halved when compared to the SMOTENC implementation proposed by Chawla et al. But honestly I'm not even sure if I'm correct about this possible bug, this seems like something so simple that I'm afraid I might be saying something stupid...

Link to the reply of the other issue (where I also described this problem in a bit more detail I believe): https://github.com/scikit-learn-contrib/imbalanced-learn/issues/860#issuecomment-1162945166

joaopfonseca avatar Jun 22 '22 11:06 joaopfonseca