LongBench
LongBench copied to clipboard
Evaluation mechanism update
Currently, many evaluations of long text models reference LongBench results. However, n-gram based metrics do not truly reflect the quality of responses. Many papers have adopted the method of using GPT-4o for scoring. Could you provide an official version of the GPT-4o scoring code to standardize the 4o scoring across various evaluations and make the results more comparable?
Great suggestion! I will update the code to support LLM-as-a-judge evaluation in the next few days.