mteb Added ArXiv Hierarchical clustering (S2S and P2P)

Checklist for adding MMTEB dataset

Reason for dataset addition: Changed ArXiv clustering tasks (S2S and P2P) to hierarchical (two levels separated by dots). #696

[x] I have tested that the dataset runs with the mteb package.
[ ] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- [ ] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [ ] intfloat/multilingual-e5-small
[ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
[x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
[x] I have filled out the metadata object in the dataset file (find documentation on it here).
[x] Run tests locally to make sure nothing is broken using make test.
[x] Run the formatter to format the code using make lint.
[ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

May 14 '24 11:05 x-tabdeveloping

I can't run it cause I don't have compute access today, will run when I can.

May 14 '24 11:05 x-tabdeveloping

Also we should probably wait for the other PR (#694 ) to merge, as the stratified subsampling will fail without the new code.

May 14 '24 11:05 x-tabdeveloping

This will change the scores of ArxivClustering, correct? I think we need to be more careful with maintaining backward compat., e.g. this point here: https://github.com/embeddings-benchmark/mteb/pull/481#issuecomment-2096636221 Maybe we can allow to also evaluate with the previous way.

May 14 '24 14:05 Muennighoff

Yeah, we need to be careful about this! Can't we use a previous release of MTEB for evaluating models for the current leaderboard? I think it would be really nice if we wouldn't have to worry about production use while trying to get the MMTEB project and paper done as fast as we can. How should we go about this @Muennighoff ?

May 14 '24 14:05 x-tabdeveloping

Yeah, we need to be careful about this! Can't we use a previous release of MTEB for evaluating models for the current leaderboard? I think it would be really nice if we wouldn't have to worry about production use while trying to get the MMTEB project and paper done as fast as we can. How should we go about this @Muennighoff ?

Actually, I think it is fine because the original ArxivClustering is still supported right? It is just superseded by this one.

Does this mean that when people run https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py it is still the same datasets as originally in MTEB? To select the newer versions, do people need to set the name as e.g. ArxivClusteringP2P.v3, correct? @KennethEnevoldsen

May 14 '24 18:05 Muennighoff

@x-tabdeveloping we should get this PR finished up

May 21 '24 09:05 KennethEnevoldsen

Okay, I will just go for dummy subsampling here as in VG, and we can add stratified subsampling, once it is properly addressed.

May 21 '24 14:05 x-tabdeveloping

mteb mteb copied to clipboard

Added ArXiv Hierarchical clustering (S2S and P2P)

Checklist for adding MMTEB dataset

mteb
mteb copied to clipboard