mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Converted VG to hierarchical

Open x-tabdeveloping opened this issue 9 months ago • 2 comments

Checklist for adding MMTEB dataset

Reason for dataset addition: VG Clustering was not hierarchical nor ClusteringFast before #656

  • [x] I have tested that the dataset runs with the mteb package.
  • [ ] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [ ] intfloat/multilingual-e5-small
  • [ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

x-tabdeveloping avatar May 14 '24 08:05 x-tabdeveloping

I can't run it for E5 as Ucloud is down today and it would take hours on my computer :')

x-tabdeveloping avatar May 14 '24 08:05 x-tabdeveloping

Also added stratified subsampling code to AbsTask for multilabel problems, as this was missing.

x-tabdeveloping avatar May 14 '24 08:05 x-tabdeveloping

@x-tabdeveloping let us get this one merged in as well

KennethEnevoldsen avatar May 21 '24 09:05 KennethEnevoldsen

[15:44] There are currently no machines available to run your job.
[15:44] A smaller machine might give you quicker access to your job.
[15:45] Job has been cancelled

:')

x-tabdeveloping avatar May 21 '24 13:05 x-tabdeveloping

Okay I got a machine halleluyah

x-tabdeveloping avatar May 21 '24 13:05 x-tabdeveloping

Since the stratified subsampling doesn't exactly work as expected with multilabel data, I will just go with a random sample of 2048 entries I think.

x-tabdeveloping avatar May 21 '24 13:05 x-tabdeveloping

@KennethEnevoldsen green light?

x-tabdeveloping avatar May 21 '24 14:05 x-tabdeveloping

Tests are failing because of missing datasets unrelated to my PR, what to do?

x-tabdeveloping avatar May 23 '24 15:05 x-tabdeveloping

pull from main should fix it

KennethEnevoldsen avatar May 24 '24 08:05 KennethEnevoldsen