mteb icon indicating copy to clipboard operation
mteb copied to clipboard

add: SICK-BR

Open mrshu opened this issue 9 months ago • 6 comments

Checklist for adding MMTEB dataset

Reason for dataset addition:

Portugese version of the SICK dataset as an STS task, which I believe is a language that is not currently covered by STS.

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

mrshu avatar May 14 '24 09:05 mrshu

One thing I am unsure of is how to deal with the fact that the dataset is a bit larger (4096 examples in the test subset). Should it be subsampled, eventhough other SICK datasets in mteb are not?

mrshu avatar May 14 '24 09:05 mrshu

Sure thing, as long as the dataset is desired, I'd be very happy to fill it all out!

mrshu avatar May 15 '24 05:05 mrshu

@KranthiGV The author has already been alerted by me about the metadata issues. I believe it's a bit confusing and unnecessary to have two reviewers on the same PR. Can you please review PRs that do not have a reviewer already assigned to them (and of course assign yourself so others can see you're handling it)?

x-tabdeveloping avatar May 15 '24 15:05 x-tabdeveloping

@KranthiGV The author has already been alerted by me about the metadata issues. I believe it's a bit confusing and unnecessary to have two reviewers on the same PR. Can you please review PRs that do not have a reviewer already assigned to them (and of course assign yourself so others can see you're handling it)?

Sure, thanks! Sorry for the confusion.

KranthiGV avatar May 15 '24 15:05 KranthiGV

@KennethEnevoldsen @x-tabdeveloping @KranthiGV I would appreciate if you could take another look. Thanks!

mrshu avatar May 16 '24 09:05 mrshu

Thanks for the feedback @x-tabdeveloping, I tried to incorporate your feedback and would appreciate if you could take another look.

mrshu avatar May 16 '24 17:05 mrshu

Thanks for the approval @x-tabdeveloping! Unfortunately, I haven't really added any points here, so I am not sure what would be the best way of doing so. Should I open a new PR?

mrshu avatar May 17 '24 09:05 mrshu

@mrshu open a new PR with the points (using the number of this PR).

KennethEnevoldsen avatar May 17 '24 09:05 KennethEnevoldsen

Oh sorry for that. I will approve the points as soon as you submit the PR just tag me.

x-tabdeveloping avatar May 17 '24 10:05 x-tabdeveloping

Thanks a bunch everyone, it should be ready in https://github.com/embeddings-benchmark/mteb/pull/754

mrshu avatar May 17 '24 10:05 mrshu