mteb Results on `BRIGHT` not matching

I ran the model on the BRIGHT benchmark using the following code:

import torch
import mteb

prompts_dict = {
    "BrightRetrieval": "Given a Post, retrieve relevant passages that help answer the post",
}

tasks = mteb.get_tasks(tasks=["BrightRetrieval"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="results",
    encode_kwargs={"batch_size": 1},
)

The results are as follows:

	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.	Avg.
ReasonIR	24.31	30.83	24.27	28.95	18.40	21.68	20.57	18.14	9.49	4.84	18.21	26.42	20.51

In the paper:

Originally posted by @whybe-choi in https://github.com/embeddings-benchmark/mteb/issues/3221#issuecomment-3355490399

Possible solution will be to create different tasks per subset.

Oct 06 '25 17:10 Samoed

Hello, @Samoed ! I'd like to help resolve this issue if possible. Is it okay for me to start working on it? Any advice or guidance you could provide would be greatly appreciated.

Oct 07 '25 05:10 whybe-choi

Hi! Yes, that would be great! For this task, you should create different tasks for each subset

Oct 07 '25 08:10 Samoed

In addition to splitting by task, it would be a good idea to separate the long-document related parts into a separate file called BrightLongRetrieval.py. What do you think about this approach?

Oct 07 '25 08:10 whybe-choi

Yes, I think this is good

Oct 07 '25 09:10 Samoed