Performance Report of Qwen2.5-Math-7B-Instruct on GaoKao Dataset

Open xiaobanni opened this issue 1 year ago • 0 comments

I used the default command:

PROMPT_TYPE="qwen25-math-cot"
MODEL_NAME_OR_PATH="Qwen/Qwen2.5-Math-7B-Instruct"
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH

With the following default setup:

DATA_NAME="gaokao2024_I,gaokao2024_II,gaokao2024_mix,gaokao_math_cloze,gaokao_math_qa"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --seed 0 \
    --temperature 0 \
    --n_sampling 1 \
    --top_p 1 \
    --start 0 \
    --end -1 \
    --use_vllm \
    --save_outputs \
    --overwrite \
    --adapt_few_shot

Attempting to reproduce results on the GaoKao dataset, I observed the following:

gaokao_math_cloze

{
    "num_samples": 118,
    "num_scores": 118,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 68.6,
    "time_use_in_second": 33.38,
    "time_use_in_minute": "0:33"
}

gaokao_math_qa

{
    "num_samples": 351,
    "num_scores": 351,
    "timeout_samples": 0,
    "empty_samples": 2,
    "acc": 57.0,
    "time_use_in_second": 126.39,
    "time_use_in_minute": "2:06"
}

gaokao2024_I

{
    "num_samples": 14,
    "num_scores": 14,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 50.0,
    "type_acc": {
        "blank": 33.3,
        "multi": 0.0,
        "single": 75.0
    },
    "time_use_in_second": 23.06,
    "time_use_in_minute": "0:23"
}

gaokao2024_II

{
    "num_samples": 14,
    "num_scores": 14,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 57.1,
    "type_acc": {
        "blank": 33.3,
        "multi": 100.0,
        "single": 50.0
    },
    "time_use_in_second": 26.19,
    "time_use_in_minute": "0:26"
}

gaokao2024_mix

{
    "num_samples": 91,
    "num_scores": 91,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 59.3,
    "time_use_in_second": 39.97,
    "time_use_in_minute": "0:39"
}

The final average accuracy achieved is 59.2%, which shows a significant gap compared to the reported accuracy of 66.3% in the paper. Could you help me identify any potential issues?

Code to calculate average accuracy:

# Defining the provided data to calculate the total `num_samples` and combined `acc`
data = [
    {"num_samples": 118, "acc": 68.6},
    {"num_samples": 351, "acc": 57.0},
    {"num_samples": 14, "acc": 50.0},
    {"num_samples": 14, "acc": 57.1},
    {"num_samples": 91, "acc": 59.3},
]

# Calculating total `num_samples` and weighted `acc`
total_samples = sum(item["num_samples"] for item in data)
weighted_acc = sum(item["num_samples"] * item["acc"] for item in data) / total_samples

total_samples, weighted_acc

Nov 01 '24 08:11 xiaobanni