Quantifying factual correctness with multiple modes at the same time

Open Razvanip13 opened this issue 6 months ago • 0 comments

Describe the Feature Rather than choosing a single mode for quantifying factual evaluation, a user should have the freedom to request multiple modes at the same time

Why is the feature important for you? Each mode (f1_score, precision, recall) allows you to interpret the correctness from a different angle. I consider that for a clear overview of the answer correctness, a user should interpret all results. At the same time, calling the factual correctness multiple times is redundant, cosidering that we use the same statements processed by an LLM for quantifying the score in each mode. By providing this feature, we can save computational time and API costs.

Additional context

Papers like https://arxiv.org/abs/2307.16877 and https://arxiv.org/pdf/2503.16161 discussed about the tradeoffs between each mode.

If the proposal is approved, I would like to be the one who implements it.

Jun 04 '25 16:06 Razvanip13