Results on `BRIGHT` not matching
I ran the model on the BRIGHT benchmark using the following code:
import torch
import mteb
prompts_dict = {
"BrightRetrieval": "Given a Post, retrieve relevant passages that help answer the post",
}
tasks = mteb.get_tasks(tasks=["BrightRetrieval"])
evaluation = mteb.MTEB(tasks=tasks)
model = mteb.get_model(
"ReasonIR/ReasonIR-8B",
model_kwargs={"torch_dtype": torch.bfloat16},
prompts_dict=prompts_dict,
)
evaluation.run(
model,
save_predictions=True,
output_folder="results",
encode_kwargs={"batch_size": 1},
)
The results are as follows:
| Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ReasonIR | 24.31 | 30.83 | 24.27 | 28.95 | 18.40 | 21.68 | 20.57 | 18.14 | 9.49 | 4.84 | 18.21 | 26.42 | 20.51 |
In the paper:
Originally posted by @whybe-choi in https://github.com/embeddings-benchmark/mteb/issues/3221#issuecomment-3355490399
Possible solution will be to create different tasks per subset.
Hello, @Samoed ! I'd like to help resolve this issue if possible. Is it okay for me to start working on it? Any advice or guidance you could provide would be greatly appreciated.
Hi! Yes, that would be great! For this task, you should create different tasks for each subset
In addition to splitting by task, it would be a good idea to separate the long-document related parts into a separate file called BrightLongRetrieval.py. What do you think about this approach?
Yes, I think this is good