mteb refactor: split `BRIGHT` benchmark into individual subset tasks

Close #3268

This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.

Benchmark additions

Introduced two new benchmarks, BRIGHT_SUBSETS and BRIGHT_SUBSETS_LONG, to the mteb/benchmarks/benchmarks/benchmarks.py file, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2]
Registered the new benchmarks in the mteb/benchmarks/benchmarks/__init__.py file for import and usage. [1] [2]

Descriptive statistics

Added descriptive statistics JSON files for each new BRIGHT subset retrieval task, including both standard and long formats (e.g., BrightBiologyRetrieval.json, BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Minor improvement

Minor formatting fix in the BEIR_NL benchmark description for improved readability.

Oct 07 '25 13:10 whybe-choi

You know that you can also simply subselect from a task using:

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

Oct 07 '25 14:10 Samoed

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

Ohh... Yeah that is hard to fix.

I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them

Oct 07 '25 15:10 KennethEnevoldsen

If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one.

Oct 07 '25 15:10 Muennighoff

Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test?

Oct 08 '25 00:10 whybe-choi

To check implementation, this will be enough, just don't update old leaderboard

Oct 08 '25 07:10 Samoed

After split BrightRetrieval into multiple tasks, I ran ReasonIR on them with task-specific prompts using the following code:

import torch
import mteb

# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
    "BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}

tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="evaluation/results",
    encode_kwargs={"batch_size": 1},
)

The results are as follows:

	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.	Avg.
before split	24.31	30.83	24.27	28.95	18.40	21.68	20.57	18.14	9.49	4.84	18.21	26.42	20.51
after split	26.18	30.71	23.96	29.76	18.62	21.15	19.89	19.65	9.22	5.12	18.34	27.12	20.81

In the paper:

Oct 09 '25 07:10 whybe-choi

Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought get_model?

Oct 09 '25 07:10 Samoed

https://github.com/embeddings-benchmark/mteb/blob/d2c704c15a6312264822be11986372cc1f7e6c6b/mteb/models/instruct_wrapper.py#L158-L171

After adding code to print the instruction inside the code, the following output was produced:

# Biology
Retrieval
    - BrightBiologyRetrieval, s2p


instruction: <|user|>
Given a Biology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:06<00:00, 15.80it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 2/50000 [00:02<18:01:38,  1.30s/it

# Psychology
Retrieval
    - BrightPsychologyRetrieval, s2p


instruction: <|user|>
Given a Psychology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:07<00:00, 14.12it/s]
instruction: <|embed|>

Batches:   0%|                                                                                                       | 0/50000 [00:01<?, ?it/s]

# Aops
Retrieval
    - BrightAopsRetrieval, s2p


instruction: <|user|>
Given a Math problem, retrieve relevant examples that help answer the problem<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 111/111 [00:06<00:00, 16.13it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 17/50000 [00:09<7:16:33,  1.91it/s]

Oct 09 '25 08:10 whybe-choi

Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case.

I've checked code for ReasonIR and found some other places that can help to reproduce:

For some cases, rewritten query is concatenated with query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L82-L87
Sometimes reason trases added to the query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L124
Maybe ids can be filtered (ref https://github.com/embeddings-benchmark/mteb/issues/2696) but in ReasonIR code they're just check that no ids are intersect https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L130-L131

@Muennighoff Can you help what we can do to reproduce results?

Oct 09 '25 08:10 Samoed

I think the IDs filtering is probably the main missing piece to fully reproduce results?

Oct 09 '25 18:10 Muennighoff

I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single ReasonIR model seems to be related to the issue mentioned in point 3.

Oct 10 '25 07:10 whybe-choi

@Samoed

I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

Oct 22 '25 14:10 whybe-choi

I think it would be better to close this PR and work on it later together

Do you mean that you don't want tasks in this pr and will add another PR for #2696?

Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

Yes, you need to add statistic to merge. To apply v2 format, you can select subsets from https://huggingface.co/datasets/mteb/BrightRetrieval, but retrieval dataset loader reqired dataset to have strictly corpus, qrels and quries, maybe we need to reupload them instead

Oct 22 '25 14:10 Samoed

What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help.

Oct 22 '25 15:10 whybe-choi

I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution

Oct 22 '25 16:10 Samoed

In the current code,I see that even if we split into sub-tasks, as long as we use the dataset from xlangai/BRIGHT, we have to load the entire dataset required for all sub-tasks even when we want to evaluate just one sub-task. This seems inefficient.

Additionally, if we use the dataset from mteb/BrightRetrieval, it would be easier to create the v2 format and load each sub-task separately. However, since it does not contain information about excluded_ids, I think it would be difficult to resolve #2696.

what do you think about that?

Oct 22 '25 20:10 whybe-choi

I think this is better to reupload them, yes. But firstly we need to resolve issues with ids

Oct 22 '25 21:10 Samoed

This commit includes the excluded_ids handling implementation for testing purposes. I'll update with performance measurement results after running the benchmarks.

Oct 23 '25 11:10 whybe-choi

I think it's better to use convert this task into reranking to select only required ids

Oct 23 '25 11:10 Samoed

Are you suggesting that we handle it by including all documents in top_ranked except those corresponding to the excluded_ids? I think that if we map the corpus excluding the excluded_ids to each query when constructing top_ranked, it will cause too much memory waste.

Oct 23 '25 11:10 whybe-choi

Are you suggesting that we handle it by including all documents in top_ranked except those corresponding to the excluded_ids?

Yes

I think that if we map the corpus excluding the excluded_ids to each query when constructing top_ranked, it will cause too much memory waste.

Maybe, but I don't think we should add this parameter to evaluation. WDYT @KennethEnevoldsen?

Oct 23 '25 11:10 Samoed

I will try to run a performance evaluation according to your idea first.

Oct 23 '25 11:10 whybe-choi

It seems that reproducing the performance to some extent is definitely possible.

top_ranked has been added for all tasks, but since only BrightLeetcodeRetrieval, BrightPonyRetrieval, BrightAopsRetrieval, and BrightTheoremQAQuestionsRetrieval have excluded_ids, it seems that top_ranked only needs to be added for those tasks.

The performance was measured based on the following code:

import logging

import torch
import mteb

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

prompts_dict = {
    "BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    max_seq_length=32768,
    prompts_dict=prompts_dict,
)

for task_name in prompts_dict.keys():
    tasks = mteb.get_tasks(tasks=[task_name])
    cache = mteb.cache.ResultCache("evaluation/cache")

    try:
        mteb.evaluate(
            model,
            tasks,
            cache=cache,
            overwrite_strategy="only-missing",
            prediction_folder="evaluation/tests",
            encode_kwargs={"batch_size": 1},
        )

    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        continue

The performance for the remaining tasks, excluding those where an OOM (Out of Memory) occurred, is as follows:

In the paper:

Oct 24 '25 15:10 whybe-choi

Great! So for now most different task is pony?

Oct 24 '25 16:10 Samoed

Among the tasks with excluded_ids, pony seems to be the most different. The other tasks seem to have reproduced the performance reported in the paper to some extent.

Oct 24 '25 16:10 whybe-choi

task Paper PR Diff

Aops 14.7 15.6 +0.9

Biology 26.2 26.1 -0.1

Economics 23.3 24.0 +0.7

Pony 10.5 9.3 -1.2

Robotics 18.0 18.6 +0.6

StackOverflow 23.9 21.1 -2.8

TheoremQAQuestion 31.9 30.1 -1.8

TheoremQATheorem 27.2 26.5 -0.7

task	Paper	PR	Diff
Aops	14.7	15.6	+0.9
Biology	26.2	26.1	-0.1
Economics	23.3	24.0	+0.7
Pony	10.5	9.3	-1.2
Robotics	18.0	18.6	+0.6
StackOverflow	23.9	21.1	-2.8
TheoremQAQuestion	31.9	30.1	-1.8
TheoremQATheorem	27.2	26.5	-0.7

I think the main difference because of that you've evaluated shot version of datasets, but this is hard to get how tasks were produced in paper table. @Muennighoff Can you help with scores reproduction?

Oct 24 '25 17:10 Samoed

Scores looking really close, great work. Are you asking me whether in the paper they were evaluated with shots or without?

Oct 24 '25 18:10 Muennighoff

were evaluated with shots or without?

Yes

Oct 24 '25 18:10 Samoed

Yeah I think those specific paper results are zero-shot

Oct 24 '25 18:10 Muennighoff

I set max_seq_length to 32768 based on the following reference. Is this correct? : https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/retrievers.py#L725-L726

Oct 25 '25 08:10 whybe-choi