mteb
mteb copied to clipboard
GPT4-o generated queries for 14 languages
Checklist for adding MMTEB dataset
Reason for dataset addition: Succinct queries generated by a strong multilingual LLM grounded in Wikipedia articles nicely chunked by Cohere should be a strict improvement over a lot of machine translated versions of SQuAD in different languages. Wikipedia is probably the highest quality (available) corpus in most languages. see #378
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [x]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [x]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [ ] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [x] I have filled out the metadata object in the dataset file (find documentation on it here).
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
WIP and I am running query generation over night for the remaining 12 languages on this list:
LANG_MAP = {
"de": "German",
"bn": "Bengali",
"it": "Italian",
"pt": "Portuguese",
"ru": "Russian",
"uk": "Ukrainian",
"nl": "Dutch",
"cs": "Czech",
"ro": "Romanian",
"bg": "Bulgarian",
"sr": "Serbian",
"fi": "Finnish",
"fa": "Persian",
"hi": "Hindi",
}
Draft PR for early feedback. @KennethEnevoldsen @Muennighoff happy to hear any suggestions :)
Generated with this prompt and temperature=0.0, max_tokens=512
.
Your task is to anticipate possible search queries by users in the form of a question for a given document.
- The question must be written in {{ language }}
- The question should be formulated concretely and precisely and relate to the information from the given document
- The question must be coherent and should make sense without knowing the document
- The question must be answerable by the document
- The question should focus on one aspect and avoid using subclauses connected with 'and'
- The question should not be overly specific and should mimic a request of a user who is just starting to research the given topic
- Do not draw on your prior knowledge
Generate a question in {{ language }} for the following document:
<document>
{{ document }}
</document>
Search query:
During generation "{title}\n\n"
was prepended to the chunk.
Query quality was inspected manually by native speakers in German and Bengali.
I calculated recent log views according to https://huggingface.co/datasets/Cohere/wikipedia-22-12 and applied them to https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.
Per language, I filtered out articles with less than 9 paragraphs and sampled 1500 articles from the top 100k viewed articles.
I selected a random window of 9 consecutive paragraphs per article and choose the middle one to be the positive context and generated a query for it with gpt-4o
.
The surrounding 8 paragraphs act as hard negatives and have a score of 0.5 in the qrels dataset.
The 9 paragraphs per article are used for the reranking task with one positive and 8 negatives. The one positive, 8 hard negatives and the remaining corpus as negatives are used in the retrieval task.
The choice of hard negatives is debatable. I could prepend "{title}\n\n"
to the chunks or add more random (true) negatives to the reranking negatives.
As it is now, the German reranking task looks too easy, but the Bengali one is fine.
How do I run reranking on a multilingual dataset? I now have the different languages as subsets in https://huggingface.co/datasets/ellamind/wikipedia-2023-11-reranking-multilingual.
But I don't see a way to specify config=
in a task. I don't think I can add multiple languages as splits.
Except for the one WikipediaRerankingMultilingual
task, I can tick almost all boxes:
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [x]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [x]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [ ] If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
- [x] I have filled out the metadata object in the dataset file (find documentation on it here).
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).
For 11 languages, we have the first retrieval task. So with 11*4 and the other points we already hit the cap of 50 points. Am I correct?
I'm trying to understand this paragraph from the points documentation.
The first dataset for a language x task gains 4 bonus points. If the number of new languages is >= 12 then points for that PR for a new dataset are capped at 50 (12 * 4 + 2 = 48 + 2 = 50).
Not all of my languages are new, so strictly speaking the cap does not apply? These are the added languages:
languages = [ "de", "bn", "it", "pt", "nl", "cs", "ro", "bg", "sr", "fi", "fa", "hi", "da", "en"]
For these languages the retrieval task is the first of its kind:
["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]
So these would give 11 * 4 = 44 points.
["de", "bn", "en"]
Are new datasets, but already have a retrieval task. So 3 * 2 = 6.
For theWikipediaRerankingMultilingual
task, I pulled all the languages together into a single dataset and there already exists a mulitlingual reranking task, so 2 points.
This would result in 52 points for this PR.
Or are the 4 bonus points meant to be added on top of the 2 points per dataset? This would result in 11 * (4+2) + 6 + 2 = 74 points.
EDIT:
points.md
suggests that the bonus points are added on top per dataset.
{
"GitHub": "GitHubUser1",
"New dataset": 2-6, # 2 points for the dataset and 4 points for the task
"New task": 2, # e.g. a new style of task (e.g. classification, or retrieval)
"Dataset annotations": 1, # 1 point for each full dataset annotation
"Bug fixes": 2-10, # depends on the complexity of the fix
"Running Models": 1, # pr model run
"Review PR": 2, # two points pr. reviewer, can be given to multiple reviewers
"Paper Writing": NA,
"Ideation": NA,
"Coordination": NA
}
EDIT2: If so, then my points for #197 need to be updated from 4 -> 6. Can we arrange for my coworkers and me to appear next to each other as coauthors? I can make slight adjustments to the current PR points, if needed.
Related to points:
I would calculate it as follows:
For the Retrieval dataset (which I do think should be combined):
["be", "bg", "cs", "nl", "fa", "fi", "hi", "it", "pt", "ro", "sr"]
is 11 * 4 = 44 and then add 2 for the dataset. So that is 46.
For the reranking you get the scores pr. language as well:
EVAL_LANGS = {
"bg": ["bul-Cyrl"],
"bn": ["ben-Beng"],
"cs": ["ces-Latn"],
"da": ["dan-Latn"],
"de": ["deu-Latn"], # has rerankin
"en": ["eng-Latn"], # has reranking
"fa": ["fas-Arab"],
"fi": ["fin-Latn"],
"hi": ["hin-Deva"],
"it": ["ita-Latn"],
"nl": ["nld-Latn"],
"pt": ["por-Latn"],
"ro": ["ron-Latn"],
"sr": ["srp-Cyrl"],
}
So that would be 2 + 4*12 = 48+2 = 50.
So in total, you get 50+48=98.
You can still max out the bonus for both datasets by adding 3 more languages (up to you if you feel like it is worth it). I you want to do that I can review Swedish and Norwegian as well.
Great, thanks for reviewing! :)
Points are more than enough, but since I already have wikipedia-no
ready, I can add that as well.
Could you give me a hint on how best to upload multilingual datasets? Right now I have the languages as dataset configs, which show up as the subset drop down menu on HF hub. Passing the languages as eval_langs=
to a task did not work for me.
I dug deeper into the code base and the only thing I came up with was to add a config=
kwarg at the point where the dataset gets actually loaded. But since this is in core MTEB, I thought there must be another way on task level.
Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?
Are we sure we want machine generated datasets? If we don't take machine-translated ones why should we take machine generated ones?
@x-tabdeveloping we had the discussion in an issue beforehand. I believe the quality is good enough to warrant inclusion (def. better than e.g. retrieval based on articles headlines I would argue). That being said it might introduce odd biases. We can def. examine if that is the case once we start running models.
Machine translated ones often translate whole passages and not all translation services are good. In the current dataset passages are human written and sampled from top sites by page views. Only short queries are generated with the strongest currenly available multilingual LLM (gpt4-o).
Generating with temperature 0 and the current prompt basically just 'rephrases' the provided human written document to a single, succinct question.
@rasdani, and @x-tabdeveloping this does raise an interesting point within the discussion section of the paper: can datasets such as these approximate the performance of high quality datasets. E.g. a comparison between MIRACL and these seems reasonable.
btw. @rasdani seems like the tests fail will you have a look at it (it seems like it is due to the mock test overwriting datasets concatenate method)
@rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.
yes, I will! tonight or tomorrow night
Gesendet von Outlook für iOShttps://aka.ms/o0ukef
Von: Kenneth Enevoldsen @.> Gesendet: Tuesday, May 21, 2024 12:05:09 PM An: embeddings-benchmark/mteb @.> Cc: Daniel Auras @.>; Mention @.> Betreff: Re: [embeddings-benchmark/mteb] GPT4-o generated queries for 14 languages (PR #718)
@rasdanihttps://github.com/rasdani I would love to have this PR merged in. Will you have a look at the tests and then I believe it is ready to merge.
— Reply to this email directly, view it on GitHubhttps://github.com/embeddings-benchmark/mteb/pull/718#issuecomment-2122256498, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARRH3HQPKUGZRZUHCI3NPFLZDML5LAVCNFSM6AAAAABHXGCXBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGI2TMNBZHA. You are receiving this because you were mentioned.Message ID: @.***>
I added "no" and "sv": https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-no-queries https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-sv-queries
I managed to fix the MultilingalReranking and added results. However, I'm stuck with some missing import for the MultilingualRetrieval task (Task not found on terminal) and I'm hitting HuggingFace upload limits for the multilingual retrieval dataset.
Will try to finish up tomorrow. If you can spot, whether I'm missing an import somewhere, please let me know.
When the MultilingualRetrieval works, I will delete the language specific retrieval tasks.
I modified the points for this current PR such that we all end up with the same number of total points, if I account for the (corrected) 6 points of my old PR. This way we should end up next to each other on the paper.
{"GitHub": "rasdani", "New dataset": 20}
{"GitHub": "ShawonAshraf", "New dataset": 26}
{"GitHub": "bjoernpl", "New dataset": 26}
{"GitHub": "jphme", "New dataset": 26}
{"GitHub": "KennethEnevoldsen", "Review PR": 2}
@rasdani, hope you are well. I was hoping that I could ask you to add a section to the paper on the wiki retrieval (appendix B2). I would like to also add the correlation plots from the issue as well
@rasdani will just shoot you a second ping here in case you missed the one above.
@KennethEnevoldsen I indeed missed the first ping.
I can add the correlation plots and write up a rough draft of what I did. Beyond that I can't put a lot of work into it unfortunately, since I'm constrained by work and other demands.
What's the process here, opening a new PR?
You can find a link to the paper here: #595
I have added two headers in appendix B4 where you can add the draft + correlation plots. After that I don't believe there will be any additional work (maybe only related to author information during submission).