lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Inconsistent evaluation results with Chat Template

Open shiweijiezero opened this issue 9 months ago • 4 comments

I evaluated llama3-8b-Instruct using the gsm8k benchmark, and found some interesting phenomenons.

  1. Huggingface and vllm has similar results image

  2. If I use vllm to start an API service, and use lm-eval local-chat mode to evaluate gsm8k, resulting a different accuracy. image

  3. I browsed the source code of lm-eval, and found that the API use the apply chat template, while inference of huggingface and vllm mode do not use chat template (i.e. there are no special tokens)

  4. I tried to use the chat template in the tokenizer and log some intermate results, could you give me some insights about it???

output of vllm in the lm-eval image

output of vllm using chat template of llama, and this do not use system prompt: image

output of vllm using chat template of llama, and this uses system prompt "you are a helpful assistant": image

In the final, the accuracy of vllm with chat template significantly drop!!! image

Are you have any idea about it?

I think the people should use more chat-template in the evaluation, since it is close to the real sceneries.

shiweijiezero avatar May 14 '24 13:05 shiweijiezero