deepeval
deepeval copied to clipboard
Add JudgeLM as a way to evaluate and compare historical test runs
Currently, metrics are computed based on test cases that run during evaluation. However, there's currently no way to compare historical test runs' performances except for comparing metric scores for each test run.
I would like to introduce another way to choose the best configurations / hyperparameters to help devs speed up development using DeepEval, and that would be to implement a JudgeLM (eg. https://github.com/baaivision/JudgeLM) to choose the best test run out of a set of test runs.
from deepeval import judge
best_performing_test_run = None
for test_run in test_runs:
best_performing_test_run = judge(test_run, best_performing_test_run)
This assumes the test runs were ran on the same test cases / evaluation dataset. There should be error handling for edge cases that don't fit into this criteria.