Evaluation mechanism update

Open cizhenshi opened this issue 1 year ago • 1 comments

Currently, many evaluations of long text models reference LongBench results. However, n-gram based metrics do not truly reflect the quality of responses. Many papers have adopted the method of using GPT-4o for scoring. Could you provide an official version of the GPT-4o scoring code to standardize the 4o scoring across various evaluations and make the results more comparable?

Oct 30 '24 02:10 cizhenshi

Great suggestion! I will update the code to support LLM-as-a-judge evaluation in the next few days.

Oct 31 '24 13:10 bys0318