I've attempted several runs to replicate the results of Qwen/Qwen2.5-Math-7B-Instruct on College Math dataset but I'm getting ~41.8 which is too far off from the 46.8 as reported (despite using the same parameter values as used in your evaluation code). Can you please advise what I have missed. Here's my code:
!python3 -u evaluation/math_eval.py
--model_name_or_path Qwen2.5-Math/models/Qwen2.5-Math-7B-Instruct
--data_name "college_math"
--data_dir Qwen2.5-Math/evaluation/data
--output_dir Qwen2.5/Qwen2.5-Math/evaluation
--split test
--prompt_type "qwen25-math-cot"
--seed 0
--temperature 0
--n_sampling 1
--top_p 1
--start 0
--end -1
--use_vllm
--save_outputs
--overwrite
==================================================
data: college_math ,remain samples: 2818
{'idx': 0, 'data_source': 'college_math.Beginning_and_Intermediate_Algebra', 'question_number': 'exercise.0.4.61', 'question': 'Simplify: $-10-4(n-5)$', 'answer': '$10-4 n$', 'license': 'Creative Commons Attribution 3.0 Unported License (CC BY 3.0)', 'data_topic': 'college_math.algebra'}
0% 0/2818 [00:00<?, ?it/s]<|im_start|>system
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
Simplify: $-10-4(n-5)$<|im_end|>
<|im_start|>assistant
100% 2818/2818 [00:11<00:00, 254.93it/s]
-------------------- Epoch 0
Processed prompts: 100% 2818/2818 [05:31<00:00, 8.50it/s, est. speed input: 587.02 toks/s, output: 5583.72 toks/s]
-------------------- Epoch 1
Unsolved samples: 0
Evaluate: 0% 0/2818 [00:00<?, ?it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 3% 97/2818 [00:03<01:18, 34.80it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 9% 259/2818 [00:05<00:52, 48.74it/s]:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
Evaluate: 10% 269/2818 [00:05<00:49, 51.28it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 34% 963/2818 [00:18<00:18, 97.97it/s] :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 42% 1185/2818 [00:25<01:06, 24.73it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 55% 1549/2818 [00:38<02:09, 9.78it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 58% 1631/2818 [00:42<01:07, 17.56it/s]
Evaluate: 58% 1643/2818 [00:47<04:01, 4.86it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 65% 1831/2818 [01:02<01:16, 12.93it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma?
Evaluate: 97% 2721/2818 [02:16<00:22, 4.26it/s]:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
Evaluate: 97% 2732/2818 [02:18<00:11, 7.43it/s]:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma?
Evaluate: 100% 2818/2818 [02:31<00:00, 18.62it/s]
{'num_samples': 2818, 'num_scores': 2818, 'timeout_samples': 1, 'empty_samples': 0, 'acc': 41.8}
Saved to Qwen2.5/Qwen2.5-Math/evaluation/college_math/test_qwen25-math-cot_-1_seed0_t0.0_s0_e-1.jsonl
college_math avg
41.8 41.8
You can use the scripts here to reproduce the results (adapted from qwen eval). It also supports majority voting and integration of process reward model.
Our results: Qwen2.5-Math-7B-Ins Greedy: 47.1 (in our previous run, the result is 46.9, so the variance may not be large).
The attached file may also help you more accurately pinpoint the source of errors — whether they stem from the LLM's generation or from the evaluation toolkit.
response.jsonl.zip