When evaluating the BGE-Code-v1 model using the CoIR dataset, why is the result in the Apps section so poor？

Open kartikzheng opened this issue 7 months ago • 1 comments

When evaluating the BGE-Code-v1 model using the CoIR dataset, why is the result in the Apps section so poor, only around 20? Below are the main configurations. The results are the same whether using the official CoIR library or the evalscope library. Is there anything wrong?

from evalscope.run import run_task

one_stage_task_cfg = {
    "work_dir": "outputs",
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "bge-code-v1",
                "pooling_mode": "lasttoken",		
                "max_seq_length": 512,
                "prompt": "<instruct>Given a code contest problem description, retrieve relevant code that can help solve the problem.\n<query>",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 128,
                },
            }
        ],
        "eval": {
            "tasks": [
                "AppsRetrieval",
            ],
            "verbosity": 2,
            "overwrite_results": True,
            "top_k": 10,
        },
    },
}

run_task(task_cfg=one_stage_task_cfg)

May 25 '25 03:05 kartikzheng

We haven't used evalscope for evaluation. We used the official code from the CoIR GitHub repository for evaluation. For details, you can refer to: https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder#coir

May 28 '25 06:05 545999961