deepeval icon indicating copy to clipboard operation
deepeval copied to clipboard

Add JudgeLM as a way to evaluate and compare historical test runs

Open penguine-ip opened this issue 8 months ago • 3 comments

Currently, metrics are computed based on test cases that run during evaluation. However, there's currently no way to compare historical test runs' performances except for comparing metric scores for each test run.

I would like to introduce another way to choose the best configurations / hyperparameters to help devs speed up development using DeepEval, and that would be to implement a JudgeLM (eg. https://github.com/baaivision/JudgeLM) to choose the best test run out of a set of test runs.

from deepeval import judge

best_performing_test_run = None
for test_run in test_runs:
      best_performing_test_run = judge(test_run, best_performing_test_run)

This assumes the test runs were ran on the same test cases / evaluation dataset. There should be error handling for edge cases that don't fit into this criteria.

penguine-ip avatar Oct 30 '23 13:10 penguine-ip