mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Issues with stratified_subsampling()

Open imenelydiaker opened this issue 10 months ago • 3 comments

I ran into the following error when using our subsampling funciton on GreekLegalCodeClassification task:

... in train_test_split 
raise ValueError(
ValueError: The least populated class in label column has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

@isaac-chung it seems that train_test_split doesn't handle stratification when there is only one sample per class. Any idea on how to solve this?

imenelydiaker avatar Apr 23 '24 13:04 imenelydiaker

Re: Problem 1, we might have to default back to a shuffle with a try/except:

self.dataset["test"] = (
    self.dataset["test"].shuffle(seed=self.seed).select(range(TEST_SAMPLES))
)

wdyt?

isaac-chung avatar Apr 23 '24 13:04 isaac-chung

I personally find it weird to have classes with only 1 sample, maybe we shouldn't handle them? We can filter the dataset and remove rows with only 1 sample, wdyt? The shuffle will just don't consider the class imbalance and I'm not sure it's good to use it in a function we named stratified_subsampling() 🤔

imenelydiaker avatar Apr 23 '24 13:04 imenelydiaker

Good point. It wouldn't be true to the name anymore. I feel that we can put a note in this method to say that we will remove rows with only 1 sample. Use with caution.

isaac-chung avatar Apr 23 '24 13:04 isaac-chung