mteb Convert tasks to ClusteringFast.

There are a number of tasks in the benchmark that should be converted to ClusteringFast in order to gain a speedup. Additionally a lot of tasks that are currently formulated as normal clustering should be rephrased as hierarchical. For instance VGClustering, SNL and ArXiv are all technically hierarchical but are currently formulated as a flat clustering problem.

I opened this issue to have one place for all of the tasks that have to be converted.

Convert to fast

[ ] French tasks #568
[ ] BigPatent
[ ] Biorxiv
[ ] Medrxiv
[x] Reddit #728
[ ] StackExchange #739
[ ] TwentyNewsgroups
[ ] WikiCities
[ ] IndicReviews
[ ] MLSUM
[ ] MasakhaNews
[ ] WikiClustering #742
[ ] 8Tags
[x] ~~RomaniBible~~ (only has 2048 examples)
[ ] Flores
[ ] CLS
[ ] ThuNews

Convert to hierarchical

[x] SNL
- [x] S2S
- [x] P2P
[ ] VG #656
[ ] #696
[ ] #702

May 10 '24 11:05 x-tabdeveloping

Migration guide

Since this might the first issue of some people here's a little migration guide for the uninitiated:

We have a task type called AbsTaskClustering, that is quite slow
We have a lot of old tasks that have enormous test sets, probably excessively so
We have a couple of tasks that are currently formulated as flat clustering tasks, despite the fact that they are hiearchical in nature.

To fix these issues we have created AbsTaskClusteringFast, which is faster, can do subsampling of the datasets (now in a stratified manner) and can deal with hierarchical clustering problems.

To migrate a dataset over to Fast clustering you will need to do a number of things.

0. Turn a dataset in the list into an issue and assign yourself

That way we'll know you're working on the issue :)

1. Make sure the dataset is in the right format

In older clustering tasks every row in a dataset is one clustering task with N passages and N labels. In AbsTaskClusteringFast there is no need for the dataset to contain multiple experiments as these are created on the spot by subsampling the data. The type of one entry in AbsTaskClustering looks like this:

{"sentences": ["sentence1", "sentence2"], "labels": ["label1", "label2",...]}

While in AbsTaskClusteringFast a dataset entry/row only contains one example, but could contain multiple labels if the task is hierarchical:

{"sentences": "Sentence1", "labels": ["label_level1", "label_level2",...]}
{"sentences": "Sentence2", "labels": ["label_level1", "label_level2",...]}

You either need to reupload the dataset after adjustments or use dataset_transform. Here's some example code of how I did it in a PR:

import itertools

from datasets import Dataset, DatasetDict

class SomeFastClusteringTask(AbsTaskClusteringFast):
    ...
    def dataset_transform(self):
        ds = dict()
        for split in self.metadata.eval_splits:
            labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
            sentences = list(itertools.chain.from_iterable(self.dataset[split]["sentences"]))
            ds[split] = Dataset.from_dict({"labels": labels, "sentences": sentences})
        self.dataset = DatasetDict(ds)

2. Subsample if needed

If the dataset is too large, subsample the test set. You are free to select random examples, or use stratified subsampling (should be preferred when possible). The test set should preferably contain no more than 2048 examples.

def dataset_transform(self):
    self.dataset = self.stratified_subsampling(
            self.dataset,
            self.seed,
            self.metadata.eval_splits,
            label="labels",
            n_samples=2048,
     )

3. Fill out the metadata properly

Some old tasks do not contain enough metadata, every new submission should have these, so please do some research on the dataset before submitting a PR.

4. S2S and P2P tasks

Some clustering tasks have shorter passages (e.g. "title") and longer ones too (e.g. "abstract"). In these cases you should add an S2S (sentence-to-sentence) task over the shorter passages, and a P2P (paragraph-to-paragraph) task over the longer passages.

5. Superseed the old tasks

Since now you have new tasks in the benchmark, we don't need the old ones anymore. However, do not delete the old tasks. Instead, add a superseeded_by attribute on the old tasks.

class SomeClusteringTask(AbsTaskClustering):
    superseeded_by = "SomeClusteringTaskFast"

6. Run the new tasks

As you will see in the PR template, you will have to run the new tasks on E5 and paraphrase models.

May 14 '24 12:05 x-tabdeveloping

Most importantly: If you have questions or need code review, make sure to mention me or text me about it. I will try to be accessible. If I'm for some reason unavailable, ping @KennethEnevoldsen

May 14 '24 12:05 x-tabdeveloping

Love this! I ran the suggested code and have one suggestion:

- labels = itertools.chain.from_iterable(self.dataset[split]["labels"])
+ labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))

- sentences = itertools.chain.from_iterable(self.dataset[split]["sentences"])
+ sentences = list(itertools.chain.from_iterable(self.dataset[split]["sentences"]))

May 15 '24 12:05 isaac-chung

@isaac-chung oh yeah right, I forgot that sorries :D

May 15 '24 13:05 x-tabdeveloping

That's a wrap!

Jun 02 '24 14:06 isaac-chung

mteb mteb copied to clipboard

Convert tasks to ClusteringFast.

Convert to fast

Convert to hierarchical

Migration guide

0. Turn a dataset in the list into an issue and assign yourself

1. Make sure the dataset is in the right format

2. Subsample if needed

3. Fill out the metadata properly

4. S2S and P2P tasks

5. Superseed the old tasks

6. Run the new tasks

mteb
mteb copied to clipboard