classifier-reborn
classifier-reborn copied to clipboard
Implement stratified k-fold cross-validation
Current k-fold cross-validation assumes that the supplied sample data is uniformly randomized, hence, performs simple slicing of the array for individual folds. We should partition the data in a way that the proportion of various classes are maintained in each fold. This can be the default or the only option or partition or alternatively an optional boolean parameter can be provided for stratification.
I'm open to this, but wouldn't know how to do it.
To enforce this, we will have to first prepare buckets of each class from the supplied sample set and then partition each subset into k equal parts. Finally, pick one chunk from each subset to make data for each of the k sets. It is not difficult to do. I can take care of it when I get a chance to play with the code again. However, for now we are shuffling the sample data before splitting, which would theoretically have the similar effect, except not very precise, depending on the randomness.