deepeval
deepeval copied to clipboard
Consideration behind a "stateful" metric UX
Is your feature request related to a problem? Please describe. Hi, I was wondering what's the considerations behind that we are choosing a "stateful" metrics UX. Using the hallucination metric for example but it applies to other metrics as well
metric = HallucinationMetric(threshold=0.5)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
By "stateful" here I mean the metric
object itself stores the state (score, reason, etc.) for the last test_case
it has run against. I feel this might create unnecessary coupling between metrics and test cases? It could lead to side effects, for example, in bulk evaluation, we can not go full parallel across both the test_cases and metrics dimensions, instead we have to go one by one for each test_case, because otherwise we can't pull the reason
out which is stored in the metric
object.
Describe the solution you'd like Any thoughts if we can make the metric UX stateless. For example:
class BaseMetric:
# not store test-related values here such as score and reason
# only store metric-related values such as evaluation model
evaluation_model: Optional[str] = None
...
@abstractmethod
# The LLMTestResult can be an extensible dict with fields such as score and reason
def measure(self, test_case: LLMTestCase, *args, **kwargs) -> LLMTestResult:
raise NotImplementedError
...
Additional context This might certainly mean some breaking compatibilities if we are to make such a fundamental UX update, so understanding any concerns here. But meanwhile I feel this could also mean a huge opportunity and thought better to raise it sooner rather than later.