mteb icon indicating copy to clipboard operation
mteb copied to clipboard

mmteb | Arabic | Retrieval Task

Open bakrianoo opened this issue 1 year ago • 1 comments

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • This is a dataset for mmteb initiative.

  • The Dataset is for Arabic Retrieval tasks

  • The Dataset is for Keyword-Based searching tasks (The retrieval part in the RAG pipeline)

  • Although the promising capabilities of using embeddings for semantic search of queries, we still notice some challenges when the query becomes too short and in keywords style.

  • [x] I have tested that the dataset runs with the mteb package.

  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.

    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()

  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).

  • [x] Run tests locally to make sure nothing is broken using make test.

  • [x] Run the formatter to format the code using make lint.

  • [] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

bakrianoo avatar May 11 '24 10:05 bakrianoo

@bakrianoo looks like the tests fail - will you have a look at this

KennethEnevoldsen avatar May 15 '24 11:05 KennethEnevoldsen

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

bakrianoo avatar May 17 '24 12:05 bakrianoo

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

Hi @bakrianoo I faced a similar error. These steps that I did to fix it:

  • go to the file tests/test_TaskMetadata.py

  • add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.

  • save the file and run make test

Ruqyai avatar May 17 '24 14:05 Ruqyai

Hi @bakrianoo I faced a similar error. These steps that I did to fix it:

go to the file tests/test_TaskMetadata.py

add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.

save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

KennethEnevoldsen avatar May 17 '24 16:05 KennethEnevoldsen

Hi @bakrianoo I faced a similar error. These steps that I did to fix it: go to the file tests/test_TaskMetadata.py add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully. save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

Thanks @KennethEnevoldsen .. I am doing here PR #763 Please check if you could merge my PR without needs to comment the test_all_metadata_is_filled function.

Ruqyai avatar May 18 '24 06:05 Ruqyai

@bakrianoo would love to have this PR merged in. I will close it for now, but if you have the time please do re-open it and adress the metadata issues. I will make sure it gets a quick review and that we finish up the metadata.

KennethEnevoldsen avatar May 21 '24 09:05 KennethEnevoldsen