smote_variants icon indicating copy to clipboard operation
smote_variants copied to clipboard

Minimum number of rows in a class

Open RyanMetz opened this issue 3 years ago • 1 comments

I've been using the ADOMS implementation in this package to balance classes for a while with great results. The other day a colleague asked me what the minimum number of rows a class must have is to reasonably oversample it. I told him that there probably wasn't a magic number and the real question was how representative of the true population the sample of instances in the class were. But the problem stuck with me, and after reading through several papers I haven't seen it dealt with. Does anyone have a rough rule or guideline about what the smallest number of rows in a class you require for oversampling with a SMOTE variant? Thanks!

RyanMetz avatar Aug 27 '20 20:08 RyanMetz

Great to hear! :) Well, yeah, important question. Most of the techniques work with as few as 2 samples from a class. However, in some cases, the operating principles require more. For example, there are some techniques which apply k-fold cross validation to check how much the generated samples fit the original distribution. In these cases at least k rows are needed.

On the flipside, just as you mention, the sample should be somewhat representative of the population. So even though the oversamplers work with 2 samples, one cannot expect useful results. I like thinking of oversampling as a highly regularized and guided kernel density estimation based sampling, and consequently, the more samples we have, the better results can be expected. But generally, I don't know about any guidance on the lowest number of samples, except the limitations implied by the operating principles of the methods.

gykovacs avatar Aug 28 '20 08:08 gykovacs