refactor: split `BRIGHT` benchmark into individual subset tasks
Close #3268
This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.
Benchmark additions
- Introduced two new benchmarks,
BRIGHT_SUBSETSandBRIGHT_SUBSETS_LONG, to themteb/benchmarks/benchmarks/benchmarks.pyfile, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2] - Registered the new benchmarks in the
mteb/benchmarks/benchmarks/__init__.pyfile for import and usage. [1] [2]
Descriptive statistics
- Added descriptive statistics JSON files for each new BRIGHT subset retrieval task, including both standard and long formats (e.g.,
BrightBiologyRetrieval.json,BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
Minor improvement
- Minor formatting fix in the
BEIR_NLbenchmark description for improved readability.
You know that you can also simply subselect from a task using:
Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea
Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea
Ohh... Yeah that is hard to fix.
I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them
If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one.
Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test?
To check implementation, this will be enough, just don't update old leaderboard
After split BrightRetrieval into multiple tasks, I ran ReasonIR on them with task-specific prompts using the following code:
import torch
import mteb
# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
"BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
"BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
"BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
"BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
"BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
"BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
"BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
"BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
"BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
"BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
"BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
"BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}
tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
model = mteb.get_model(
"ReasonIR/ReasonIR-8B",
model_kwargs={"torch_dtype": torch.bfloat16},
prompts_dict=prompts_dict,
)
evaluation.run(
model,
save_predictions=True,
output_folder="evaluation/results",
encode_kwargs={"batch_size": 1},
)
The results are as follows:
| Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| before split | 24.31 | 30.83 | 24.27 | 28.95 | 18.40 | 21.68 | 20.57 | 18.14 | 9.49 | 4.84 | 18.21 | 26.42 | 20.51 |
| after split | 26.18 | 30.71 | 23.96 | 29.76 | 18.62 | 21.15 | 19.89 | 19.65 | 9.22 | 5.12 | 18.34 | 27.12 | 20.81 |
In the paper:
Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought get_model?
https://github.com/embeddings-benchmark/mteb/blob/d2c704c15a6312264822be11986372cc1f7e6c6b/mteb/models/instruct_wrapper.py#L158-L171
After adding code to print the instruction inside the code, the following output was produced:
# Biology
Retrieval
- BrightBiologyRetrieval, s2p
instruction: <|user|>
Given a Biology post, retrieve relevant passages that help answer the post<|embed|>
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:06<00:00, 15.80it/s]
instruction: <|embed|>
Batches: 0%| | 2/50000 [00:02<18:01:38, 1.30s/it
# Psychology
Retrieval
- BrightPsychologyRetrieval, s2p
instruction: <|user|>
Given a Psychology post, retrieve relevant passages that help answer the post<|embed|>
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:07<00:00, 14.12it/s]
instruction: <|embed|>
Batches: 0%| | 0/50000 [00:01<?, ?it/s]
# Aops
Retrieval
- BrightAopsRetrieval, s2p
instruction: <|user|>
Given a Math problem, retrieve relevant examples that help answer the problem<|embed|>
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 111/111 [00:06<00:00, 16.13it/s]
instruction: <|embed|>
Batches: 0%| | 17/50000 [00:09<7:16:33, 1.91it/s]
Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case.
I've checked code for ReasonIR and found some other places that can help to reproduce:
- For some cases, rewritten query is concatenated with query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L82-L87
- Sometimes reason trases added to the query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L124
- Maybe ids can be filtered (ref https://github.com/embeddings-benchmark/mteb/issues/2696) but in ReasonIR code they're just check that no ids are intersect https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L130-L131
@Muennighoff Can you help what we can do to reproduce results?
I think the IDs filtering is probably the main missing piece to fully reproduce results?
I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single ReasonIR model seems to be related to the issue mentioned in point 3.
@Samoed
I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?
I think it would be better to close this PR and work on it later together
Do you mean that you don't want tasks in this pr and will add another PR for #2696?
Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?
Yes, you need to add statistic to merge. To apply v2 format, you can select subsets from https://huggingface.co/datasets/mteb/BrightRetrieval, but retrieval dataset loader reqired dataset to have strictly corpus, qrels and quries, maybe we need to reupload them instead
What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help.
I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution
In the current code,I see that even if we split into sub-tasks, as long as we use the dataset from xlangai/BRIGHT, we have to load the entire dataset required for all sub-tasks even when we want to evaluate just one sub-task. This seems inefficient.
Additionally, if we use the dataset from mteb/BrightRetrieval, it would be easier to create the v2 format and load each sub-task separately. However, since it does not contain information about excluded_ids, I think it would be difficult to resolve #2696.
what do you think about that?
I think this is better to reupload them, yes. But firstly we need to resolve issues with ids
This commit includes the excluded_ids handling implementation for testing purposes. I'll update with performance measurement results after running the benchmarks.
I think it's better to use convert this task into reranking to select only required ids
Are you suggesting that we handle it by including all documents in top_ranked except those corresponding to the excluded_ids? I think that if we map the corpus excluding the excluded_ids to each query when constructing top_ranked, it will cause too much memory waste.
Are you suggesting that we handle it by including all documents in top_ranked except those corresponding to the excluded_ids?
Yes
I think that if we map the corpus excluding the excluded_ids to each query when constructing top_ranked, it will cause too much memory waste.
Maybe, but I don't think we should add this parameter to evaluation. WDYT @KennethEnevoldsen?
I will try to run a performance evaluation according to your idea first.
It seems that reproducing the performance to some extent is definitely possible.
top_ranked has been added for all tasks, but since only BrightLeetcodeRetrieval, BrightPonyRetrieval, BrightAopsRetrieval, and BrightTheoremQAQuestionsRetrieval have excluded_ids, it seems that top_ranked only needs to be added for those tasks.
The performance was measured based on the following code:
import logging
import torch
import mteb
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
prompts_dict = {
"BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
"BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
"BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
"BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
"BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
"BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
"BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
"BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
"BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
"BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
"BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
"BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}
model = mteb.get_model(
"ReasonIR/ReasonIR-8B",
model_kwargs={"torch_dtype": torch.bfloat16},
max_seq_length=32768,
prompts_dict=prompts_dict,
)
for task_name in prompts_dict.keys():
tasks = mteb.get_tasks(tasks=[task_name])
cache = mteb.cache.ResultCache("evaluation/cache")
try:
mteb.evaluate(
model,
tasks,
cache=cache,
overwrite_strategy="only-missing",
prediction_folder="evaluation/tests",
encode_kwargs={"batch_size": 1},
)
except torch.cuda.OutOfMemoryError:
torch.cuda.empty_cache()
continue
The performance for the remaining tasks, excluding those where an OOM (Out of Memory) occurred, is as follows:
In the paper:
Great! So for now most different task is pony?
Among the tasks with excluded_ids, pony seems to be the most different. The other tasks seem to have reproduced the performance reported in the paper to some extent.
| task | Paper | PR | Diff |
|---|---|---|---|
| Aops | 14.7 | 15.6 | +0.9 |
| Biology | 26.2 | 26.1 | -0.1 |
| Economics | 23.3 | 24.0 | +0.7 |
| Pony | 10.5 | 9.3 | -1.2 |
| Robotics | 18.0 | 18.6 | +0.6 |
| StackOverflow | 23.9 | 21.1 | -2.8 |
| TheoremQAQuestion | 31.9 | 30.1 | -1.8 |
| TheoremQATheorem | 27.2 | 26.5 | -0.7 |
I think the main difference because of that you've evaluated shot version of datasets, but this is hard to get how tasks were produced in paper table. @Muennighoff Can you help with scores reproduction?
Scores looking really close, great work. Are you asking me whether in the paper they were evaluated with shots or without?
were evaluated with shots or without?
Yes
Yeah I think those specific paper results are zero-shot
I set max_seq_length to 32768 based on the following reference. Is this correct? :
https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/retrievers.py#L725-L726