geval
geval copied to clipboard
More benchmarks and prompt clarification
Hi team,
Thank you so much for this work, it is interesting and inspiring to me. I wonder would you plan to release prompts and results for two more benchmarks you report in your paper as well?
Also, I find in your prompt, only fluency is set to be 1-3, while other aspects are all 1-5, and fluency has a detailed rubric for each level, while others do not, is there any reason you set so?
Also, you leave the examples empty, does that mean if we provide some examples to GPT-4, it would better help the model to align its evaluation standard to human level we desired?
Thank you!