promptfoo context-recall prompt not returning expected answer format.

Describe the bug The context-recall prompt is expected by the matchesContextRecall to produce a list with of single sentence/line statements followed by an [Attributed|NotAttributed] marker, like it is illustrated in the single-shot example within the prompt:

1. Albert Einstein born in 14 March 1879 was  German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. The date of birth of Einstein is mentioned clearly in the context. So [Attributed]
2. He received the 1921 Nobel Prize in Physics "for his services to theoretical physics. The exact sentence is present in the given context. So [Attributed]
3. He published 4 papers in 1905. There is no mention about papers he wrote in given the context. So [Not Attributed]
4. Einstein moved to Switzerland in 1895. There is not supporting evidence for this in the given the context. So [Not Attributed]

The format is essential since the number of [Attributed] occurences is divided by the number of lines/statements to get the score.

However from my experiments with GPT-4 and GTP-4o this is not the case. Which leads me to a follow-up question, are these prompts themselves evaluated by promptfoo in the project?

To Reproduce

context:

Request a signature from the service owner on the DR plan. The Service Owner of FOO will be responsible for reviewing and approving the DRP. They will sign the DRP electronically to ensure that they have reviewed and approved the plan

ground truth:

The Signature from the service owner on the DR plan is requested

output with debug

"message":{"content":"Let's analyze the answer sentence by sentence and classify each sentence based on whether it can be attributed to the given context.\n\ncontext: Request a signature from the service owner on the DR plan. The Service Owner of FOO will be responsible for reviewing and approving the DRP. They will sign the DRP electronically to ensure that they have reviewed and approved the plan.\n\nanswer: The Signature from the service owner on the DR plan is requested.\n\nclassification:\n1. The Signature from the service owner on the DR plan is requested. The context clearly states that a signature from the service owner on the DR plan is requested. So [Attributed]","role":"assistant"}

Result: FailureRecall 0.13 is < 0.5 Expected behavior The prompt produces stable output. A Contrived test like the one above should give a score of 1. Maybe a JSON output would be a better choice, but this might be trickier on smaller LLMs

System information (please complete the following information):

Promptfoo version: 0.79.0
LLMs: gpt-4-1106-preview, gpt-4o-2024-05-13

Aug 22 '24 08:08 AlexRRR

Thanks for flagging. This is a RAGAS metric that copied their implementation. I think they've since refined it, and it probably makes sense to re-implement the metrics according to what the very talented people at Ragas have done since then. I bet that would reduce the likelihood of this problem.

Aug 30 '24 04:08 typpo

Hello, I am facing a similar issue. From this issue description, it seems as if the problem was just formatting? But how is it so? Like @AlexRRR mentioned, I am implementing a contrived test that should definitely get the score of 1

Jan 13 '25 17:01 GDistel