[BUG] NCL - class should be cleaned if number of sampes is 0.5 * minority samples, not if 0.5* data.shape[0]
Describe the bug
Neighbourhood cleaning rule procedure:
- Split data T into the class of interest C (minority) and the rest of data O.
- Identify noisy data A1 in O with edited nearest neighbor rule.
- For each class Ci in O: (this is, for each observation in the majority class(es) if ( x Ci in 3-nearest neighbors of misclassified y C ) and ( | Ci | ‡ 0.5 · | C | ) then A2 = { x } A2
- Reduced data S = T - ( A1 union A2 )
The above is a copy of the pseudo code in the article. There, C is the minority class or class of interest.
Further quote what is on the article: "To avoid excessive reduction of small classes, only examples from classes larger or equal to 0.5 * | C | are considered while forming A2. " and it previously mentions that C is the minority. They refer to the entire dataset as T.
I renamed the issue, because after reading the paper further, my original interpretation was wrong, and the implementation in imbalanced learn reflects what is proposed in the paper. Apart from the criteria to exclude observations from the cleaning procedure.
@glemaitre @chkoar was this parameter set up as a n_samples > X.shape[0] * self.threshold_cleaning for some reason?
Otherwise, I am happy to pick this up. Pls let me know.
n_samples > X.shape[0] * self.threshold_cleaning
It corresponds to C_i > C * t where by default t is 0.5 as in the paper. Then, we put a parameter such that one has control to clean other classes.
I will add some additional tests now but the algorithm looks fine to me.
Oh no, I see your point. Indeed, it should be the minority class indeed.