mteb
mteb copied to clipboard
add: SICK-BR
Checklist for adding MMTEB dataset
Reason for dataset addition:
Portugese version of the SICK dataset as an STS task, which I believe is a language that is not currently covered by STS.
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [x]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [x]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [x] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [x] I have filled out the metadata object in the dataset file (find documentation on it here).
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
One thing I am unsure of is how to deal with the fact that the dataset is a bit larger (4096 examples in the test
subset). Should it be subsampled, eventhough other SICK datasets in mteb
are not?
Sure thing, as long as the dataset is desired, I'd be very happy to fill it all out!
@KranthiGV The author has already been alerted by me about the metadata issues. I believe it's a bit confusing and unnecessary to have two reviewers on the same PR. Can you please review PRs that do not have a reviewer already assigned to them (and of course assign yourself so others can see you're handling it)?
@KranthiGV The author has already been alerted by me about the metadata issues. I believe it's a bit confusing and unnecessary to have two reviewers on the same PR. Can you please review PRs that do not have a reviewer already assigned to them (and of course assign yourself so others can see you're handling it)?
Sure, thanks! Sorry for the confusion.
@KennethEnevoldsen @x-tabdeveloping @KranthiGV I would appreciate if you could take another look. Thanks!
Thanks for the feedback @x-tabdeveloping, I tried to incorporate your feedback and would appreciate if you could take another look.
Thanks for the approval @x-tabdeveloping! Unfortunately, I haven't really added any points here, so I am not sure what would be the best way of doing so. Should I open a new PR?
@mrshu open a new PR with the points (using the number of this PR).
Oh sorry for that. I will approve the points as soon as you submit the PR just tag me.
Thanks a bunch everyone, it should be ready in https://github.com/embeddings-benchmark/mteb/pull/754