HumanEval benchmark producing unreasonably low scores
Hello, LLMC team!
I was glad to discover that you added support to HumanEval benchmark,
However, when trying to produce sanity results for unquantized models, they were significantly lower than the official ones.
The models I tried to evaluate were
Qwen/Qwen2-1.5B-Instruct(officialHumanEvalscore 37.8%, in LLMC repo: ~4%)Qwen/Qwen2.5-Coder-1.5B-Instruct(official score 70.7%, in LLMC repo: ~15%).
The evaluation configuration I was using:
eval:
eval_pos: [pretrain]
type: code
name: human_eval
download: False
seq_len: 2048 # unsure if it has effect at all...
format_tabs: True
res_path: <results_path>
bs: 1
inference_per_block: False
I also tried to play with additional parameters, as add_chat_temp, generation parameters temperature, top_p, do_sample - but to no avail.
Am I missing something, or is there a problem with the test implementation?
Thanks in advance!
Is it caused by prompt? https://github.com/ModelTC/llmc/blob/bc9367fb8088e9040cc3d20c8ce7e44c32d95e8c/llmc/eval/eval_code.py#L20C8-L20C9
@gushiqiao , Thank you for your response. Can you please share a configuration for some model that yielded HumanEval results that were consistent with the officially published ones?
I'm a little lost with the prompting here.
I tried to dig in the [official] evaluation code of Qwen2.5-Coder and copy their prompt, but it resulted in the Qwen2.5-Coder-1.5B-Instruct model outputting only a single line of code, before receiving the EOS token...
Any ideas from your side?..