haystack
haystack copied to clipboard
Handle errors separately in evaluators and the run results
Context
When running the evaluators over larger datasets, depending on the model, it is very common to run into LLM errors where the output is not valid JSON. For example, while running the benchmark scripts over the ARAGOG dataset, I always have one row that got incorrect JSON, so every time I've run the script I get a score report which is not very useful, such as:
metrics | score |
---|---|
context_relevance | NaN |
In that case, the output of the LLM-based evaluation metric whenever there is an error is something like:
{'statements': [], 'statement_scores': [], 'score': nan}
As a user, I would like to keep track of the errors that happened during evaluation, so ideally this should be returned as a flag, for example:
{'statements': [], 'statement_scores': [], 'score': nan, 'error': True}
Then, in the evaluation score report, we could return the mean of the scores by ignoring the errors:
metrics | score | total_errors |
---|---|---|
context_relevance | 0.9 | 1 |
Outcome
- Changes to the LLM-based evaluators (context relevancy and faithfulness) so they return an error Flag.
- Changes to the LLM-based evaluators to return a score even if there are rows with
np.nan
(for example, a suggestion is to changenp.mean
withnp.nanmean
here). - Changes to the
score_report()
function of the EvaluationRunResult to return total errors.