Add more deterministic/math-based assertions
Is your feature request related to a problem? Please describe. Some of the common math-based evaluation metrics for NLP/LLM includes ROUGE (already supported), BLEU, METEOR, GLEU and some others.
See https://github.com/Aldenhovel/bleu-rouge-meteor-cider-spice-eval4imagecaption and https://huggingface.co/spaces/evaluate-metric/google_bleu for details and examples.
Describe the solution you'd like I'd like these common evaluation metrics to be available as assertions in promptfoo.
Describe alternatives you've considered Use a custom assertion to implement them. I believe it would be beneficial to all promptfoo users to have such assertions built-in.
@mldangelo Hi, I was looking at tackling this issue. Here's how I'm planning to go about this:
- define meteor, gleu in the enum here: https://github.com/promptfoo/promptfoo/blob/main/src/types/index.ts#L387
- register a corresponding handler here: https://github.com/promptfoo/promptfoo/blob/main/src/assertions/index.ts#L231
- define the handlers in their individual files similar to
rouge.ts
does this sound good? Let me know if you see any concerns
Awesome @adityabharadwaj198!
I just opened a PR with guidance on adding a new assertion: https://github.com/promptfoo/promptfoo/pull/3610 Please feel free to leave comments on the PR if you think parts of it can be improved.
You can reference https://github.com/promptfoo/promptfoo/pull/3605, https://github.com/promptfoo/promptfoo/pull/2469, and https://github.com/promptfoo/promptfoo/pull/2081 as recent assertion PRs.
Good luck! And send me an email when you're done to michael @ promptfoo.dev and I'll send you some swag.
thanks @mldangelo !
@mldangelo I opened a PR for adding meteor score: https://github.com/promptfoo/promptfoo/pull/3776. Would love to hear your thoughts on it!