Qwen2.5-Math
Qwen2.5-Math copied to clipboard
Performance Report of Qwen2.5-Math-7B-Instruct on GaoKao Dataset
I used the default command:
PROMPT_TYPE="qwen25-math-cot"
MODEL_NAME_OR_PATH="Qwen/Qwen2.5-Math-7B-Instruct"
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH
With the following default setup:
DATA_NAME="gaokao2024_I,gaokao2024_II,gaokao2024_mix,gaokao_math_cloze,gaokao_math_qa"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--output_dir ${OUTPUT_DIR} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--seed 0 \
--temperature 0 \
--n_sampling 1 \
--top_p 1 \
--start 0 \
--end -1 \
--use_vllm \
--save_outputs \
--overwrite \
--adapt_few_shot
Attempting to reproduce results on the GaoKao dataset, I observed the following:
gaokao_math_cloze
{
"num_samples": 118,
"num_scores": 118,
"timeout_samples": 0,
"empty_samples": 0,
"acc": 68.6,
"time_use_in_second": 33.38,
"time_use_in_minute": "0:33"
}
gaokao_math_qa
{
"num_samples": 351,
"num_scores": 351,
"timeout_samples": 0,
"empty_samples": 2,
"acc": 57.0,
"time_use_in_second": 126.39,
"time_use_in_minute": "2:06"
}
gaokao2024_I
{
"num_samples": 14,
"num_scores": 14,
"timeout_samples": 0,
"empty_samples": 0,
"acc": 50.0,
"type_acc": {
"blank": 33.3,
"multi": 0.0,
"single": 75.0
},
"time_use_in_second": 23.06,
"time_use_in_minute": "0:23"
}
gaokao2024_II
{
"num_samples": 14,
"num_scores": 14,
"timeout_samples": 0,
"empty_samples": 0,
"acc": 57.1,
"type_acc": {
"blank": 33.3,
"multi": 100.0,
"single": 50.0
},
"time_use_in_second": 26.19,
"time_use_in_minute": "0:26"
}
gaokao2024_mix
{
"num_samples": 91,
"num_scores": 91,
"timeout_samples": 0,
"empty_samples": 0,
"acc": 59.3,
"time_use_in_second": 39.97,
"time_use_in_minute": "0:39"
}
The final average accuracy achieved is 59.2%, which shows a significant gap compared to the reported accuracy of 66.3% in the paper. Could you help me identify any potential issues?
Code to calculate average accuracy:
# Defining the provided data to calculate the total `num_samples` and combined `acc`
data = [
{"num_samples": 118, "acc": 68.6},
{"num_samples": 351, "acc": 57.0},
{"num_samples": 14, "acc": 50.0},
{"num_samples": 14, "acc": 57.1},
{"num_samples": 91, "acc": 59.3},
]
# Calculating total `num_samples` and weighted `acc`
total_samples = sum(item["num_samples"] for item in data)
weighted_acc = sum(item["num_samples"] * item["acc"] for item in data) / total_samples
total_samples, weighted_acc