promptfoo Add more deterministic/math-based assertions

Is your feature request related to a problem? Please describe. Some of the common math-based evaluation metrics for NLP/LLM includes ROUGE (already supported), BLEU, METEOR, GLEU and some others.

See https://github.com/Aldenhovel/bleu-rouge-meteor-cider-spice-eval4imagecaption and https://huggingface.co/spaces/evaluate-metric/google_bleu for details and examples.

Describe the solution you'd like I'd like these common evaluation metrics to be available as assertions in promptfoo.

Describe alternatives you've considered Use a custom assertion to implement them. I believe it would be beneficial to all promptfoo users to have such assertions built-in.

Sep 23 '24 06:09 sinedied

@mldangelo Hi, I was looking at tackling this issue. Here's how I'm planning to go about this:

define meteor, gleu in the enum here: https://github.com/promptfoo/promptfoo/blob/main/src/types/index.ts#L387
register a corresponding handler here: https://github.com/promptfoo/promptfoo/blob/main/src/assertions/index.ts#L231
define the handlers in their individual files similar to rouge.ts

does this sound good? Let me know if you see any concerns

Apr 07 '25 21:04 adityabharadwaj198

Awesome @adityabharadwaj198!

I just opened a PR with guidance on adding a new assertion: https://github.com/promptfoo/promptfoo/pull/3610 Please feel free to leave comments on the PR if you think parts of it can be improved.

You can reference https://github.com/promptfoo/promptfoo/pull/3605, https://github.com/promptfoo/promptfoo/pull/2469, and https://github.com/promptfoo/promptfoo/pull/2081 as recent assertion PRs.

Good luck! And send me an email when you're done to michael @ promptfoo.dev and I'll send you some swag.

Apr 07 '25 23:04 mldangelo

thanks @mldangelo !

Apr 08 '25 02:04 adityabharadwaj198

@mldangelo I opened a PR for adding meteor score: https://github.com/promptfoo/promptfoo/pull/3776. Would love to hear your thoughts on it!

Apr 23 '25 14:04 adityabharadwaj198