mteb GPT4-o generated queries for 14 languages

Checklist for adding MMTEB dataset

Reason for dataset addition: Succinct queries generated by a strong multilingual LLM grounded in Wikipedia articles nicely chunked by Cohere should be a strict improvement over a lot of machine translated versions of SQuAD in different languages. Wikipedia is probably the highest quality (available) corpus in most languages. see #378

[x] I have tested that the dataset runs with the mteb package.
[x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x] intfloat/multilingual-e5-small
[x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
[ ] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
[x] I have filled out the metadata object in the dataset file (find documentation on it here).
[x] Run tests locally to make sure nothing is broken using make test.
[x] Run the formatter to format the code using make lint.
[ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

WIP and I am running query generation over night for the remaining 12 languages on this list:

LANG_MAP = {
    "de": "German",
    "bn": "Bengali",
    "it": "Italian",
    "pt": "Portuguese",
    "ru": "Russian",
    "uk": "Ukrainian",
    "nl": "Dutch",
    "cs": "Czech",
    "ro": "Romanian",
    "bg": "Bulgarian",
    "sr": "Serbian",
    "fi": "Finnish",
    "fa": "Persian",
    "hi": "Hindi",
}

Draft PR for early feedback. @KennethEnevoldsen @Muennighoff happy to hear any suggestions :)

May 15 '24 01:05 rasdani

Generated with this prompt and temperature=0.0, max_tokens=512.

Your task is to anticipate possible search queries by users in the form of a question for a given document.
- The question must be written in {{ language }}
- The question should be formulated concretely and precisely and relate to the information from the given document
- The question must be coherent and should make sense without knowing the document
- The question must be answerable by the document
- The question should focus on one aspect and avoid using subclauses connected with 'and'
- The question should not be overly specific and should mimic a request of a user who is just starting to research the given topic
- Do not draw on your prior knowledge

Generate a question in {{ language }} for the following document:
<document>
{{ document }}
</document>

Search query:

During generation "{title}\n\n" was prepended to the chunk.

Query quality was inspected manually by native speakers in German and Bengali.

May 15 '24 01:05 rasdani

I calculated recent log views according to https://huggingface.co/datasets/Cohere/wikipedia-22-12 and applied them to https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.

Per language, I filtered out articles with less than 9 paragraphs and sampled 1500 articles from the top 100k viewed articles.

I selected a random window of 9 consecutive paragraphs per article and choose the middle one to be the positive context and generated a query for it with gpt-4o. The surrounding 8 paragraphs act as hard negatives and have a score of 0.5 in the qrels dataset.

The 9 paragraphs per article are used for the reranking task with one positive and 8 negatives. The one positive, 8 hard negatives and the remaining corpus as negatives are used in the retrieval task.

The choice of hard negatives is debatable. I could prepend "{title}\n\n" to the chunks or add more random (true) negatives to the reranking negatives. As it is now, the German reranking task looks too easy, but the Bengali one is fine.

May 15 '24 08:05 rasdani

How do I run reranking on a multilingual dataset? I now have the different languages as subsets in https://huggingface.co/datasets/ellamind/wikipedia-2023-11-reranking-multilingual.

But I don't see a way to specify config= in a task. I don't think I can add multiple languages as splits.

Except for the one WikipediaRerankingMultilingual task, I can tick almost all boxes:

[x] I have tested that the dataset runs with the mteb package.
[x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x] intfloat/multilingual-e5-small
[x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
[ ] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
[x] I have filled out the metadata object in the dataset file (find documentation on it here).
[x] Run tests locally to make sure nothing is broken using make test.
[x] Run the formatter to format the code using make lint.
[ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

For 11 languages, we have the first retrieval task. So with 11*4 and the other points we already hit the cap of 50 points. Am I correct?

May 15 '24 22:05 rasdani

I'm trying to understand this paragraph from the points documentation.

The first dataset for a language x task gains 4 bonus points. If the number of new languages is >= 12 then points for that PR for a new dataset are capped at 50 (12 * 4 + 2 = 48 + 2 = 50).

Not all of my languages are new, so strictly speaking the cap does not apply? These are the added languages:

languages = [ "de", "bn", "it", "pt", "nl", "cs", "ro", "bg", "sr", "fi", "fa", "hi", "da", "en"]

For these languages the retrieval task is the first of its kind:

["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]

So these would give 11 * 4 = 44 points.

["de", "bn", "en"]

Are new datasets, but already have a retrieval task. So 3 * 2 = 6.

For theWikipediaRerankingMultilingual task, I pulled all the languages together into a single dataset and there already exists a mulitlingual reranking task, so 2 points.

This would result in 52 points for this PR.

Or are the 4 bonus points meant to be added on top of the 2 points per dataset? This would result in 11 * (4+2) + 6 + 2 = 74 points.

EDIT: points.md suggests that the bonus points are added on top per dataset.

{
    "GitHub": "GitHubUser1",
    "New dataset": 2-6,  # 2 points for the dataset and 4 points for the task
    "New task": 2, # e.g. a new style of task (e.g. classification, or retrieval)
    "Dataset annotations": 1, # 1 point for each full dataset annotation
    "Bug fixes": 2-10, # depends on the complexity of the fix
    "Running Models": 1, # pr model run
    "Review PR": 2, # two points pr. reviewer, can be given to multiple reviewers
    "Paper Writing": NA, 
    "Ideation": NA,
    "Coordination": NA
}

EDIT2: If so, then my points for #197 need to be updated from 4 -> 6. Can we arrange for my coworkers and me to appear next to each other as coauthors? I can make slight adjustments to the current PR points, if needed.

May 16 '24 11:05 rasdani

Related to points:

I would calculate it as follows:

For the Retrieval dataset (which I do think should be combined):

["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]

is 11 * 4 = 44 and then add 2 for the dataset. So that is 46.

For the reranking you get the scores pr. language as well:

EVAL_LANGS = {
     "bg": ["bul-Cyrl"],
     "bn": ["ben-Beng"],
     "cs": ["ces-Latn"],
     "da": ["dan-Latn"], 
     "de": ["deu-Latn"], # has rerankin
     "en": ["eng-Latn"], # has reranking
     "fa": ["fas-Arab"],
     "fi": ["fin-Latn"],
     "hi": ["hin-Deva"],
     "it": ["ita-Latn"],
     "nl": ["nld-Latn"],
     "pt": ["por-Latn"],
     "ro": ["ron-Latn"],
     "sr": ["srp-Cyrl"],
 }

So that would be 2 + 4*12 = 48+2 = 50.

So in total, you get 50+48=98.

You can still max out the bonus for both datasets by adding 3 more languages (up to you if you feel like it is worth it). I you want to do that I can review Swedish and Norwegian as well.

May 17 '24 08:05 KennethEnevoldsen

Great, thanks for reviewing! :) Points are more than enough, but since I already have wikipedia-no ready, I can add that as well.

Could you give me a hint on how best to upload multilingual datasets? Right now I have the languages as dataset configs, which show up as the subset drop down menu on HF hub. Passing the languages as eval_langs= to a task did not work for me. I dug deeper into the code base and the only thing I came up with was to add a config= kwarg at the point where the dataset gets actually loaded. But since this is in core MTEB, I thought there must be another way on task level.

May 17 '24 09:05 rasdani

Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?

May 17 '24 09:05 x-tabdeveloping

Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?

@x-tabdeveloping we had the discussion in an issue beforehand. I believe the quality is good enough to warrant inclusion (def. better than e.g. retrieval based on articles headlines I would argue). That being said it might introduce odd biases. We can def. examine if that is the case once we start running models.

May 17 '24 09:05 KennethEnevoldsen

Machine translated ones often translate whole passages and not all translation services are good. In the current dataset passages are human written and sampled from top sites by page views. Only short queries are generated with the strongest currenly available multilingual LLM (gpt4-o).

Generating with temperature 0 and the current prompt basically just 'rephrases' the provided human written document to a single, succinct question.

May 17 '24 09:05 rasdani

@rasdani, and @x-tabdeveloping this does raise an interesting point within the discussion section of the paper: can datasets such as these approximate the performance of high quality datasets. E.g. a comparison between MIRACL and these seems reasonable.

btw. @rasdani seems like the tests fail will you have a look at it (it seems like it is due to the mock test overwriting datasets concatenate method)

May 17 '24 09:05 KennethEnevoldsen

@rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.

May 21 '24 10:05 KennethEnevoldsen

yes, I will! tonight or tomorrow night

Gesendet von Outlook für iOShttps://aka.ms/o0ukef

Von: Kenneth Enevoldsen @.> Gesendet: Tuesday, May 21, 2024 12:05:09 PM An: embeddings-benchmark/mteb @.> Cc: Daniel Auras @.>; Mention @.> Betreff: Re: [embeddings-benchmark/mteb] GPT4-o generated queries for 14 languages (PR #718)

@rasdanihttps://github.com/rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.

— Reply to this email directly, view it on GitHubhttps://github.com/embeddings-benchmark/mteb/pull/718#issuecomment-2122256498, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARRH3HQPKUGZRZUHCI3NPFLZDML5LAVCNFSM6AAAAABHXGCXBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGI2TMNBZHA. You are receiving this because you were mentioned.Message ID: @.***>

May 21 '24 10:05 rasdani

I added "no" and "sv": https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-no-queries https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-sv-queries

I managed to fix the MultilingalReranking and added results. However, I'm stuck with some missing import for the MultilingualRetrieval task (Task not found on terminal) and I'm hitting HuggingFace upload limits for the multilingual retrieval dataset.

Will try to finish up tomorrow. If you can spot, whether I'm missing an import somewhere, please let me know.

When the MultilingualRetrieval works, I will delete the language specific retrieval tasks.

May 21 '24 22:05 rasdani

I modified the points for this current PR such that we all end up with the same number of total points, if I account for the (corrected) 6 points of my old PR. This way we should end up next to each other on the paper.

{"GitHub": "rasdani", "New dataset": 20}
{"GitHub": "ShawonAshraf", "New dataset": 26}
{"GitHub": "bjoernpl", "New dataset": 26}
{"GitHub": "jphme", "New dataset": 26}
{"GitHub": "KennethEnevoldsen", "Review PR": 2}

May 22 '24 20:05 rasdani

@rasdani, hope you are well. I was hoping that I could ask you to add a section to the paper on the wiki retrieval (appendix B2). I would like to also add the correlation plots from the issue as well

Jun 17 '24 15:06 KennethEnevoldsen

@rasdani will just shoot you a second ping here in case you missed the one above.

Jul 02 '24 10:07 KennethEnevoldsen

@KennethEnevoldsen I indeed missed the first ping.

I can add the correlation plots and write up a rough draft of what I did. Beyond that I can't put a lot of work into it unfortunately, since I'm constrained by work and other demands.

What's the process here, opening a new PR?

Jul 02 '24 11:07 rasdani

You can find a link to the paper here: #595

I have added two headers in appendix B4 where you can add the draft + correlation plots. After that I don't believe there will be any additional work (maybe only related to author information during submission).

Jul 02 '24 11:07 KennethEnevoldsen

mteb mteb copied to clipboard

GPT4-o generated queries for 14 languages

Checklist for adding MMTEB dataset

mteb
mteb copied to clipboard