FactualCorrectness false negative calculation
Describe the bug
The calculation of answer_correctness metric (0.75factual_correctness+0.25semantic_similarity), does not use the same implementation of factual_correctness as when FactualCorrectness is computed directly. This discrepancy leads to inconsistent results in which answer_correcness can be lower than both, factual_correctness and semantic_similarity.
Code to Reproduce
score = evaluate(dataset, metrics=[answer_correctness, answer_similarity, FactualCorrectness()], llm=self.azure_model, embeddings=self.azure_embeddings, run_config=self.run_config)
Error trace
No error given.
Expected behavior I would expect to factual correctness is the same inside answer_correctness and when calculated with FactualCorrectness directly.
How is fp = fn? fp is sum of ~reference_response and fn is sum of ~response_reference. They are sum of negation of different array.
How is
fp = fn?fpis sum of~reference_responseandfnis sum of~response_reference. They are sum of negation of different array.
Yes, that is true. I edit the issue. Anyway, in some cases I get different valué from FactualCorrecness and from the inside metric fractal correctness calculated inside answer_correctness.
I confirm the same situation. The reason seems to be that the factual_correctness used in answer_correctness has a different prompt compared to the one used in factual_correctness alone. It would be really nice to have also this version of the factual correctness available.