Evaluator inputs, outputs, and names should be more consistent to allow drop-in replacement
When users switch between statistical evaluators and model-based evaluators the input and output names of the component often change. We should make them more consistent.
Output Names
For example, every evaluator should output a dict with at least the following keys: name, score, individual scores. The name is used in the EvaluationResult for the column name.
{“score”: 0.75, “individual_scores”: [0.5, 1.0], "name": "exact match"}
Input names LLM-based evaluators have "responses" as inputs, whereas statistical metrics expect "predicted_answers" and "answers". We should rename the inputs of the evaluation framework integrations too or provide a component that converts Haystack style inputs (answer, Document) to integrations (responses, contexts)
Names of Evaluator components Consistently putting Answer or Document or both as a prefix is impracticable. We don't want to rename SASEvaluator to AnswerSASEvaluator We don't want to rename FaithfulnessEvaluator to AnswerQueryContextFaithfulnessEvaluator Should ContextRelevance receive Documents as input instead of str called context?
Output types Datatypes might be inconsistent with numpy float32 arrays vs. lists of floats?