mteb mmteb | Arabic | Retrieval Task

Checklist for adding MMTEB dataset

Reason for dataset addition:

This is a dataset for mmteb initiative.
The Dataset is for Arabic Retrieval tasks
The Dataset is for Keyword-Based searching tasks (The retrieval part in the RAG pipeline)
Although the promising capabilities of using embeddings for semantic search of queries, we still notice some challenges when the query becomes too short and in keywords style.
[x] I have tested that the dataset runs with the mteb package.
[x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x] intfloat/multilingual-e5-small
[x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
[x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
[x] I have filled out the metadata object in the dataset file (find documentation on it here).
[x] Run tests locally to make sure nothing is broken using make test.
[x] Run the formatter to format the code using make lint.
[] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

May 11 '24 10:05 bakrianoo

@bakrianoo looks like the tests fail - will you have a look at this

May 15 '24 11:05 KennethEnevoldsen

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

May 17 '24 12:05 bakrianoo

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

Hi @bakrianoo I faced a similar error. These steps that I did to fix it:

go to the file tests/test_TaskMetadata.py
add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.
save the file and run make test

May 17 '24 14:05 Ruqyai

Hi @bakrianoo I faced a similar error. These steps that I did to fix it:

go to the file tests/test_TaskMetadata.py

add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.

save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

May 17 '24 16:05 KennethEnevoldsen

Hi @bakrianoo I faced a similar error. These steps that I did to fix it: go to the file tests/test_TaskMetadata.py add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully. save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

Thanks @KennethEnevoldsen .. I am doing here PR #763 Please check if you could merge my PR without needs to comment the test_all_metadata_is_filled function.

May 18 '24 06:05 Ruqyai

@bakrianoo would love to have this PR merged in. I will close it for now, but if you have the time please do re-open it and adress the metadata issues. I will make sure it gets a quick review and that we finish up the metadata.

May 21 '24 09:05 KennethEnevoldsen