mteb icon indicating copy to clipboard operation
mteb copied to clipboard

add xsim++ task under retrieval category

Open jaygala24 opened this issue 10 months ago • 5 comments

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

jaygala24 avatar May 01 '24 04:05 jaygala24

@KennethEnevoldsen

Sorry, I have opened a new PR for the previously closed PR https://github.com/embeddings-benchmark/mteb/pull/601 as I accidentally messed up with the sync of the forked repo.

I was trying to make the current dataset compatible with fast loading but however, it seems I am getting the following error. I did look up the issue (https://github.com/huggingface/datasets/issues/5612) on datasets library but it seems that issue is not resolved.

raise ValueError(f"Arrow type {arrow_type} does not have a datasets dtype equivalent.")
ValueError: Arrow type large_list<item: large_string> does not have a datasets dtype equivalent.

Please let me know how we should proceed for this PR.

jaygala24 avatar May 01 '24 04:05 jaygala24

Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions

Andrian0s avatar May 01 '24 08:05 Andrian0s

Added some comments in the previous PR #601 as I am implementing something similar now and I have been working on the same issues.

Andrian0s avatar May 01 '24 09:05 Andrian0s

Isn't this by definition a CrossLingual Task (EN-Other languages) and therefore, shouldn't this extend the CrosslingualTask instead of MultilingualTask? Then you would also need to change your language definitions

I would use the cross-lingual in this case.

Regarding the fast loading since you overwrite the load_data method I don't believe it would actually do anything.

Atm. however I examined the comments of @Andrian0s (in the original PR). These seem to raise whether the current task formulation is relevant. I believe it might be worth taking that discussion first before moving on. @jaygala24 will you address the comments (let us keep it in this thread) - it might be worth either introducing a variant task (which seems to be required due to the encode_corpus/encode_query).

KennethEnevoldsen avatar May 01 '24 12:05 KennethEnevoldsen

This PR #560 will allow Retrieval tasks to handle CrossLingual tasks also

imenelydiaker avatar May 02 '24 08:05 imenelydiaker

@jaygala24 seems like this PR has gone stale. Will close it for now but feel free to re-open it if you want to finish it up

KennethEnevoldsen avatar May 21 '24 09:05 KennethEnevoldsen

@KennethEnevoldsen Sorry, I was away due to some personal reasons so couldn't follow up with the discussion here. I'll review the entire discussion and then either re-open this PR or make a new PR.

jaygala24 avatar May 30 '24 16:05 jaygala24

Wonderful @jaygala24 glad to have you back!

KennethEnevoldsen avatar May 31 '24 11:05 KennethEnevoldsen