haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Handle errors separately in evaluators and the run results

Open mrm1001 opened this issue 7 months ago • 1 comments

Context

When running the evaluators over larger datasets, depending on the model, it is very common to run into LLM errors where the output is not valid JSON. For example, while running the benchmark scripts over the ARAGOG dataset, I always have one row that got incorrect JSON, so every time I've run the script I get a score report which is not very useful, such as:

metrics score
context_relevance NaN

In that case, the output of the LLM-based evaluation metric whenever there is an error is something like: {'statements': [], 'statement_scores': [], 'score': nan}

As a user, I would like to keep track of the errors that happened during evaluation, so ideally this should be returned as a flag, for example: {'statements': [], 'statement_scores': [], 'score': nan, 'error': True}

Then, in the evaluation score report, we could return the mean of the scores by ignoring the errors:

metrics score total_errors
context_relevance 0.9 1

Outcome

  • Changes to the LLM-based evaluators (context relevancy and faithfulness) so they return an error Flag.
  • Changes to the LLM-based evaluators to return a score even if there are rows with np.nan (for example, a suggestion is to change np.mean with np.nanmean here).
  • Changes to the score_report() function of the EvaluationRunResult to return total errors.

mrm1001 avatar Jul 03 '24 15:07 mrm1001