modAL icon indicating copy to clipboard operation
modAL copied to clipboard

Fitting classifiers with bootstrapping on small datasets with few classes risks having only one class in dataset

Open OskarLiew opened this issue 5 years ago • 1 comments

As the title says. For demonstrative purposes, say I have a committee with 50 learners and 2 data points of class A and B and I want to fit them with bootstrapping (for some reason). Then I will likely get an exception from sklearn that a classifier only has one class in its data.

A possible fix would be to ensure that at least one sample from each class is present in the bootstrapped data.

def get_bootstrap_idx(y_training):
    n_instances = y_training.shape[0]
    bootstrap_idx = np.array([], dtype=int)
    classes = np.unique(y_training)
    for y in classes:
        idx = np.where(y_training == y)[0]
        bootstrap_idx = np.append(bootstrap_idx, np.random.choice(idx, 1))
    bootstrap_idx = np.append(bootstrap_idx, np.random.choice(range(n_instances), n_instances - len(classes), replace=True))
    return bootstrap_idx

OskarLiew avatar Sep 21 '20 14:09 OskarLiew

I am not sure what would be a proper solution here. Forcing bootstrapping to always contain at least two classes is kind of an artificial solution. Thinking about what to do, will return soon!

cosmic-cortex avatar Sep 22 '20 06:09 cosmic-cortex