Qwen2.5-Math 用评测代码测试 Qwen2.5-Math-1.5B 结果和 report 的结果出入比较大

您好，请问 base model 的评测是有专门的 prompt 吗？直接用对 instruct 模型的评测代码测试Qwen2.5-Math-1.5B，结果与 report 结果差距有点大。

Dec 10 '24 15:12 pipixiaqishi1

Hi, have you solved this problem? I also meet with similar problem

Apr 15 '25 20:04 ypwang61

Hi, have you solved this problem? I also meet with similar problem

I fixed this by using the https://github.com/ZubinGou/math-evaluation-harness, which is one of the foundations of this repo.

Apr 16 '25 03:04 pipixiaqishi1

Hi, have you solved this problem? I also meet with similar problem

I fixed this by using the https://github.com/ZubinGou/math-evaluation-harness, which is one of the foundations of this repo.

Hi, sorry to bother you. Would you mind sharing an evaluation configuration for the Qwen2.5-Math base models, such as top_k and temperature?

Jun 18 '25 08:06 1998v7

Hi, have you solved this problem? I also meet with similar problem

I fixed this by using the https://github.com/ZubinGou/math-evaluation-harness, which is one of the foundations of this repo.

Hi, sorry to bother you. Would you mind sharing an evaluation configuration for the Qwen2.5-Math base models, such as top_k and temperature?

Hi, sorry, I am not really deep into it. Maybe you can check the report of Qwen2.5-Math to seek the configuration if they provided it. I evaluate models with the default config of https://github.com/ZubinGou/math-evaluation-harness.

Jun 18 '25 08:06 pipixiaqishi1

Hi, have you solved this problem? I also meet with similar problem

I fixed this by using the https://github.com/ZubinGou/math-evaluation-harness, which is one of the foundations of this repo.

Hi, sorry to bother you. Would you mind sharing an evaluation configuration for the Qwen2.5-Math base models, such as top_k and temperature?

Hi, sorry, I am not really deep into it. Maybe you can check the report of Qwen2.5-Math to seek the configuration if they provided it. I evaluate models with the default config of https://github.com/ZubinGou/math-evaluation-harness.

For qwen2.5-math base model, does the results generated by this repo match the score provide in the paper?

Jun 18 '25 08:06 1998v7

with reasonable differences

Jun 18 '25 09:06 pipixiaqishi1