LooGLE
LooGLE copied to clipboard
Prompt format for different models
Hi! I have read the codes for open source model evaluation. I noticed that, different from some existing benchmarks such as LongBench or L-Eval, there is not prompt customization part for different models (e.g. the prompt format of vicuna series is different from the original LlaMa-2). For fair comparison, do you think such customization should be added to the codes?
Hi, we agree that dedicately customizated prompting for different tasks helps. Since it can discover the potential of models through more standardized output formats to get better performance when assessment.
As far as we know, LongBench desinged different instructions for different datasets/tasks instead of models. In our case, we select the most popular and common NLP tasks (summarization, QA) for evaluation. There are no strict requirements on the output format of these tasks, while we indeed design prompts adaptively for cloze tasks for fair comparison.
Thanks for clarification!