opik [FR]: A way to compute a summary metric from multiple metrics

[FR]: A way to compute a summary metric from multiple metrics

Open jkseppan opened this issue 9 months ago • 3 comments

Proposal summary

I can evaluate a prompt or an application using multiple criteria:

result = evaluate(
    dataset,
    task,
    scoring_metrics=[
        metric1,
        metric2,
        ...
        metric100
    ]
)

but I would also like to compress the metrics into one number (or a few numbers), probably as a weighted average. I don't see a way to add such derived metrics in the current API.

Motivation

I have a complex task that I want to evaluate on multiple detailed criteria: for example, "should mention point X", "should mention name Y", "should return at most 1000 characters", "should use correct punctuation", etc. I would like to see an overall comparison to see if one prompt or model is better than another on a majority of the criteria. My current plan to do this is to get the metrics from the API as JSON and build my own dashboard, but that dashboard feels like the wrong place to add computation.

Feb 26 '25 05:02 jkseppan

Hi @jkseppan for such purposes you can implement your custom metric. See https://opik.docs.buildwithfern.com/docs/opik/evaluation/metrics/custom_metric.

Inside the score method it will call all your other metrics and then aggregate the result in a way you want before returning ScoreResult object.

Feb 26 '25 09:02 alexkuzmik

@alexkuzmik This has come up a few times, what do you think about adding the concept of an AggregateMetric ? In the case above we could compute the aggregate score based on logged metrics

Feb 26 '25 17:02 jverre

@jverre @jkseppan I created an internal ticket, we'll try to tackle it in the nearest future. The way it will likely be defined:

class AggregatedMetric(BaseMetric):
    def __init__(
        self,
        name: str,
        metrics: List[BaseMetric],
        aggregator: Callable[List[ScoreResult], ScoreResult],
        track: bool = True,
    ):

Mar 07 '25 13:03 alexkuzmik

opik opik copied to clipboard

[FR]: A way to compute a summary metric from multiple metrics

Proposal summary

Motivation

opik
opik copied to clipboard