promptfoo icon indicating copy to clipboard operation
promptfoo copied to clipboard

[Feature Request] Implement G-EVAL Inspired Weighted Sum Scoring for LLM Grading

Open MotzWanted opened this issue 1 year ago • 0 comments

Potential concern:

In the self-grading example, you utilise LLMs to judge text quality using a form-based method. I've spotted two problems with this scoring approach:

  1. Often, one number, like 3 on a 1-5 scale, seems to appear a lot more in the scores. This can make scores too similar and not match well with human opinions.
  2. Even if the prompt asks for decimals, LLMs tend to give whole number scores. This results in a lot of identical scores, missing the tiny differences between texts.

Feature Description:

Integrate a G-EVAL inspired grading system that addresses these issues, by using the probabilities of output tokens from LLMs to normalise the scores and take their weighted summation as the final score.

Expected Benefits:

  • Improved Correlation: As observed in the G-EVAL paper, this approach has a higher correlation with human evaluations compared to other NLG evaluators.
  • Refined Grading: Using the probabilities of output rating tokens ensures a more nuanced and detailed grading system.

MotzWanted avatar Sep 20 '23 06:09 MotzWanted