mteb
mteb copied to clipboard
Issues with stratified_subsampling()
I ran into the following error when using our subsampling funciton on GreekLegalCodeClassification
task:
... in train_test_split
raise ValueError(
ValueError: The least populated class in label column has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
@isaac-chung it seems that train_test_split
doesn't handle stratification when there is only one sample per class. Any idea on how to solve this?
Re: Problem 1, we might have to default back to a shuffle with a try/except:
self.dataset["test"] = (
self.dataset["test"].shuffle(seed=self.seed).select(range(TEST_SAMPLES))
)
wdyt?
I personally find it weird to have classes with only 1 sample, maybe we shouldn't handle them? We can filter the dataset and remove rows with only 1 sample, wdyt?
The shuffle will just don't consider the class imbalance and I'm not sure it's good to use it in a function we named stratified_subsampling()
🤔
Good point. It wouldn't be true to the name anymore. I feel that we can put a note in this method to say that we will remove rows with only 1 sample. Use with caution.