lm-evaluation-harness Inconsistent evaluation results with Chat Template

Inconsistent evaluation results with Chat Template

Open shiweijiezero opened this issue 9 months ago • 4 comments

I evaluated llama3-8b-Instruct using the gsm8k benchmark, and found some interesting phenomenons.

Huggingface and vllm has similar results
If I use vllm to start an API service, and use lm-eval local-chat mode to evaluate gsm8k, resulting a different accuracy.
I browsed the source code of lm-eval, and found that the API use the apply chat template, while inference of huggingface and vllm mode do not use chat template (i.e. there are no special tokens)
I tried to use the chat template in the tokenizer and log some intermate results, could you give me some insights about it???