Unable to replicate the results of Qwen2.5-Math-7B-Instruct

Open nurmanmus opened this issue 9 months ago • 1 comments

I've attempted several runs to replicate the results of Qwen/Qwen2.5-Math-7B-Instruct on College Math dataset but I'm getting ~41.8 which is too far off from the 46.8 as reported (despite using the same parameter values as used in your evaluation code). Can you please advise what I have missed. Here's my code:

!python3 -u evaluation/math_eval.py
--model_name_or_path Qwen2.5-Math/models/Qwen2.5-Math-7B-Instruct
--data_name "college_math"
--data_dir Qwen2.5-Math/evaluation/data
--output_dir Qwen2.5/Qwen2.5-Math/evaluation
--split test
--prompt_type "qwen25-math-cot"
--seed 0
--temperature 0
--n_sampling 1
--top_p 1
--start 0
--end -1
--use_vllm
--save_outputs
--overwrite

================================================== data: college_math ,remain samples: 2818 {'idx': 0, 'data_source': 'college_math.Beginning_and_Intermediate_Algebra', 'question_number': 'exercise.0.4.61', 'question': 'Simplify: $-10-4(n-5)$', 'answer': '$10-4 n$', 'license': 'Creative Commons Attribution 3.0 Unported License (CC BY 3.0)', 'data_topic': 'college_math.algebra'} 0% 0/2818 [00:00<?, ?it/s]<|im_start|>system Please reason step by step, and put your final answer within \boxed{}.<|im_end|> <|im_start|>user Simplify: $-10-4(n-5)$<|im_end|> <|im_start|>assistant

100% 2818/2818 [00:11<00:00, 254.93it/s] -------------------- Epoch 0 Processed prompts: 100% 2818/2818 [05:31<00:00, 8.50it/s, est. speed input: 587.02 toks/s, output: 5583.72 toks/s] -------------------- Epoch 1 Unsolved samples: 0 Evaluate: 0% 0/2818 [00:00<?, ?it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 3% 97/2818 [00:03<01:18, 34.80it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 9% 259/2818 [00:05<00:52, 48.74it/s]:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? Evaluate: 10% 269/2818 [00:05<00:49, 51.28it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 34% 963/2818 [00:18<00:18, 97.97it/s] :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 42% 1185/2818 [00:25<01:06, 24.73it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 55% 1549/2818 [00:38<02:09, 9.78it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 58% 1631/2818 [00:42<01:07, 17.56it/s] Evaluate: 58% 1643/2818 [00:47<04:01, 4.86it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 65% 1831/2818 [01:02<01:16, 12.93it/s]:1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'tuple' object is not callable; perhaps you missed a comma? Evaluate: 97% 2721/2818 [02:16<00:22, 4.26it/s]:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? Evaluate: 97% 2732/2818 [02:18<00:11, 7.43it/s]:1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? :1: SyntaxWarning: 'set' object is not callable; perhaps you missed a comma? Evaluate: 100% 2818/2818 [02:31<00:00, 18.62it/s] {'num_samples': 2818, 'num_scores': 2818, 'timeout_samples': 1, 'empty_samples': 0, 'acc': 41.8} Saved to Qwen2.5/Qwen2.5-Math/evaluation/college_math/test_qwen25-math-cot_-1_seed0_t0.0_s0_e-1.jsonl college_math avg
41.8 41.8

Apr 01 '25 14:04 nurmanmus

You can use the scripts here to reproduce the results (adapted from qwen eval). It also supports majority voting and integration of process reward model.

Our results: Qwen2.5-Math-7B-Ins Greedy: 47.1 (in our previous run, the result is 46.9, so the variance may not be large).

The attached file may also help you more accurately pinpoint the source of errors — whether they stem from the LLM's generation or from the evaluation toolkit.

response.jsonl.zip

Apr 16 '25 14:04 yyDing1