LightCompress icon indicating copy to clipboard operation
LightCompress copied to clipboard

HumanEval benchmark producing unreasonably low scores

Open sasha-hailo opened this issue 10 months ago • 2 comments

Hello, LLMC team! I was glad to discover that you added support to HumanEval benchmark, However, when trying to produce sanity results for unquantized models, they were significantly lower than the official ones.

The models I tried to evaluate were

  • Qwen/Qwen2-1.5B-Instruct (official HumanEval score 37.8%, in LLMC repo: ~4%)
  • Qwen/Qwen2.5-Coder-1.5B-Instruct (official score 70.7%, in LLMC repo: ~15%).

The evaluation configuration I was using:

eval:
    eval_pos: [pretrain]
    type: code
    name: human_eval
    download: False
    seq_len: 2048  # unsure if it has effect at all...
    format_tabs: True
    res_path: <results_path>
    bs: 1
    inference_per_block: False

I also tried to play with additional parameters, as add_chat_temp, generation parameters temperature, top_p, do_sample - but to no avail.

Am I missing something, or is there a problem with the test implementation?

Thanks in advance!

sasha-hailo avatar Feb 23 '25 09:02 sasha-hailo

Is it caused by prompt? https://github.com/ModelTC/llmc/blob/bc9367fb8088e9040cc3d20c8ce7e44c32d95e8c/llmc/eval/eval_code.py#L20C8-L20C9

gushiqiao avatar Mar 10 '25 09:03 gushiqiao

@gushiqiao , Thank you for your response. Can you please share a configuration for some model that yielded HumanEval results that were consistent with the officially published ones?

I'm a little lost with the prompting here. I tried to dig in the [official] evaluation code of Qwen2.5-Coder and copy their prompt, but it resulted in the Qwen2.5-Coder-1.5B-Instruct model outputting only a single line of code, before receiving the EOS token... Any ideas from your side?..

sasha-hailo avatar Mar 12 '25 16:03 sasha-hailo