lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Inconsistent evaluation results with Chat Template
I evaluated llama3-8b-Instruct using the gsm8k benchmark, and found some interesting phenomenons.
-
Huggingface and vllm has similar results
-
If I use vllm to start an API service, and use lm-eval local-chat mode to evaluate gsm8k, resulting a different accuracy.
-
I browsed the source code of lm-eval, and found that the API use the apply chat template, while inference of huggingface and vllm mode do not use chat template (i.e. there are no special tokens)
-
I tried to use the chat template in the tokenizer and log some intermate results, could you give me some insights about it???
output of vllm in the lm-eval
output of vllm using chat template of llama, and this do not use system prompt:
output of vllm using chat template of llama, and this uses system prompt "you are a helpful assistant":
In the final, the accuracy of vllm with chat template significantly drop!!!
Are you have any idea about it?
I think the people should use more chat-template in the evaluation, since it is close to the real sceneries.