opik
opik copied to clipboard
[FR]: A way to compute a summary metric from multiple metrics
Proposal summary
I can evaluate a prompt or an application using multiple criteria:
result = evaluate(
dataset,
task,
scoring_metrics=[
metric1,
metric2,
...
metric100
]
)
but I would also like to compress the metrics into one number (or a few numbers), probably as a weighted average. I don't see a way to add such derived metrics in the current API.
Motivation
I have a complex task that I want to evaluate on multiple detailed criteria: for example, "should mention point X", "should mention name Y", "should return at most 1000 characters", "should use correct punctuation", etc. I would like to see an overall comparison to see if one prompt or model is better than another on a majority of the criteria. My current plan to do this is to get the metrics from the API as JSON and build my own dashboard, but that dashboard feels like the wrong place to add computation.
Hi @jkseppan for such purposes you can implement your custom metric. See https://opik.docs.buildwithfern.com/docs/opik/evaluation/metrics/custom_metric.
Inside the score method it will call all your other metrics and then aggregate the result in a way you want before returning ScoreResult object.
@alexkuzmik This has come up a few times, what do you think about adding the concept of an AggregateMetric ? In the case above we could compute the aggregate score based on logged metrics
@jverre @jkseppan I created an internal ticket, we'll try to tackle it in the nearest future. The way it will likely be defined:
class AggregatedMetric(BaseMetric):
def __init__(
self,
name: str,
metrics: List[BaseMetric],
aggregator: Callable[List[ScoreResult], ScoreResult],
track: bool = True,
):