deepeval Consideration behind a "stateful" metric UX

Consideration behind a "stateful" metric UX

Open Peilun-Li opened this issue 11 months ago • 9 comments

Is your feature request related to a problem? Please describe. Hi, I was wondering what's the considerations behind that we are choosing a "stateful" metrics UX. Using the hallucination metric for example but it applies to other metrics as well

metric = HallucinationMetric(threshold=0.5)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

By "stateful" here I mean the metric object itself stores the state (score, reason, etc.) for the last test_case it has run against. I feel this might create unnecessary coupling between metrics and test cases? It could lead to side effects, for example, in bulk evaluation, we can not go full parallel across both the test_cases and metrics dimensions, instead we have to go one by one for each test_case, because otherwise we can't pull the reason out which is stored in the metric object.

Describe the solution you'd like Any thoughts if we can make the metric UX stateless. For example:

class BaseMetric:
    # not store test-related values here such as score and reason
    # only store metric-related values such as evaluation model
    evaluation_model: Optional[str] = None

...

    @abstractmethod
    # The LLMTestResult can be an extensible dict with fields such as score and reason
    def measure(self, test_case: LLMTestCase, *args, **kwargs) -> LLMTestResult:
        raise NotImplementedError
...

Additional context This might certainly mean some breaking compatibilities if we are to make such a fundamental UX update, so understanding any concerns here. But meanwhile I feel this could also mean a huge opportunity and thought better to raise it sooner rather than later.

Mar 14 '24 00:03 Peilun-Li

deepeval deepeval copied to clipboard

Consideration behind a "stateful" metric UX

deepeval
deepeval copied to clipboard