mteb
mteb copied to clipboard
Linguistic Probing Tasks
Checklist for adding MMTEB dataset
Reason for dataset addition:
It is a linguistic probing tasks proposed in this paper. We could have them as a new Task however they are classification tasks at the end.
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [x]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [x]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [x] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [x] I have filled out the metadata object in the dataset file (find documentation on it here).
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
I will soon create an issue regarding concerns over this PR.
Seems like this was moved to a discussion that was never finished. Will close it for now, but feel free to re-open it.