mteb
mteb copied to clipboard
Convert tasks to ClusteringFast.
There are a number of tasks in the benchmark that should be converted to ClusteringFast in order to gain a speedup. Additionally a lot of tasks that are currently formulated as normal clustering should be rephrased as hierarchical. For instance VGClustering, SNL and ArXiv are all technically hierarchical but are currently formulated as a flat clustering problem.
I opened this issue to have one place for all of the tasks that have to be converted.
Convert to fast
- [ ] French tasks #568
- [ ] BigPatent
- [ ] Biorxiv
- [ ] Medrxiv
- [x] Reddit #728
- [ ] StackExchange #739
- [ ] TwentyNewsgroups
- [ ] WikiCities
- [ ] IndicReviews
- [ ] MLSUM
- [ ] MasakhaNews
- [ ] WikiClustering #742
- [ ] 8Tags
- [x] ~~RomaniBible~~ (only has 2048 examples)
- [ ] Flores
- [ ] CLS
- [ ] ThuNews
Convert to hierarchical
- [x] SNL
- [x] S2S
- [x] P2P
- [ ] VG #656
- [ ] #696
- [ ] #702
Migration guide
Since this might the first issue of some people here's a little migration guide for the uninitiated:
- We have a task type called
AbsTaskClustering
, that is quite slow - We have a lot of old tasks that have enormous test sets, probably excessively so
- We have a couple of tasks that are currently formulated as flat clustering tasks, despite the fact that they are hiearchical in nature.
To fix these issues we have created AbsTaskClusteringFast
, which is faster, can do subsampling of the datasets (now in a stratified manner) and can deal with hierarchical clustering problems.
To migrate a dataset over to Fast clustering you will need to do a number of things.
0. Turn a dataset in the list into an issue and assign yourself
That way we'll know you're working on the issue :)
1. Make sure the dataset is in the right format
In older clustering tasks every row in a dataset is one clustering task with N passages and N labels.
In AbsTaskClusteringFast
there is no need for the dataset to contain multiple experiments as these are created on the spot by subsampling the data.
The type of one entry in AbsTaskClustering
looks like this:
{"sentences": ["sentence1", "sentence2"], "labels": ["label1", "label2",...]}
While in AbsTaskClusteringFast
a dataset entry/row only contains one example, but could contain multiple labels if the task is hierarchical:
{"sentences": "Sentence1", "labels": ["label_level1", "label_level2",...]}
{"sentences": "Sentence2", "labels": ["label_level1", "label_level2",...]}
You either need to reupload the dataset after adjustments or use dataset_transform
.
Here's some example code of how I did it in a PR:
import itertools
from datasets import Dataset, DatasetDict
class SomeFastClusteringTask(AbsTaskClusteringFast):
...
def dataset_transform(self):
ds = dict()
for split in self.metadata.eval_splits:
labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
sentences = list(itertools.chain.from_iterable(self.dataset[split]["sentences"]))
ds[split] = Dataset.from_dict({"labels": labels, "sentences": sentences})
self.dataset = DatasetDict(ds)
2. Subsample if needed
If the dataset is too large, subsample the test set. You are free to select random examples, or use stratified subsampling (should be preferred when possible). The test set should preferably contain no more than 2048 examples.
def dataset_transform(self):
self.dataset = self.stratified_subsampling(
self.dataset,
self.seed,
self.metadata.eval_splits,
label="labels",
n_samples=2048,
)
3. Fill out the metadata properly
Some old tasks do not contain enough metadata, every new submission should have these, so please do some research on the dataset before submitting a PR.
4. S2S and P2P tasks
Some clustering tasks have shorter passages (e.g. "title"
) and longer ones too (e.g. "abstract"
).
In these cases you should add an S2S (sentence-to-sentence) task over the shorter passages, and a P2P (paragraph-to-paragraph) task over the longer passages.
5. Superseed the old tasks
Since now you have new tasks in the benchmark, we don't need the old ones anymore.
However, do not delete the old tasks. Instead, add a superseeded_by
attribute on the old tasks.
class SomeClusteringTask(AbsTaskClustering):
superseeded_by = "SomeClusteringTaskFast"
6. Run the new tasks
As you will see in the PR template, you will have to run the new tasks on E5 and paraphrase models.
Most importantly: If you have questions or need code review, make sure to mention me or text me about it. I will try to be accessible. If I'm for some reason unavailable, ping @KennethEnevoldsen
Love this! I ran the suggested code and have one suggestion:
- labels = itertools.chain.from_iterable(self.dataset[split]["labels"])
+ labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
- sentences = itertools.chain.from_iterable(self.dataset[split]["sentences"])
+ sentences = list(itertools.chain.from_iterable(self.dataset[split]["sentences"]))
@isaac-chung oh yeah right, I forgot that sorries :D
That's a wrap!