distilabel
distilabel copied to clipboard
[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)
The idea would be to build and run a benchmark with at least the following datasets: HHH Alignment & MT Bench Human Judgment.
Our current preference task are:
- UltraFeedback: with different aspects (honest, helpful, etc.) so we need to see how to compute the overall benchmark. We can start with the
for_text_quality()
aspect because it's a summary of the other aspects. - JudgeLM
- UltraJudge: our own variation of uf and judgelm
The main idea is to compute the chosen and rejected response and compare it with the ones in the benchmark. Based on this we can compute typical classification metrics (accuracy, precision, recall, f1)
This benchmark will be very useful as we can run it when we develop or integrate new techniques.
- we can start with a sample of each dataset is large
I ran the datasets for each of the tasks. From the hhh alignment dataset, I took only the "other" part, and for the mt bench the first 100 questions. The mt bench dataset has also a "tie" label, so it is more of a multiclass classification. As an LLM I used ChatGPT.
In general, the hhh alignment is labelled almost in the same was as humans would, while mt bench has very low agreement with humans. For now I only ran to get metrics and I plan to look why exactly the mt bench is getting such scores. Below are the results for now:
-
HHH for Text Quality vs 2. MT Bench for Text Quality Accuracy: 0.930, Recall: 0.930, Precision: 1.000, F1: 0.964 Accuracy: 0.350, Recall: 0.350, Precision: 0.357, F1: 0.351
-
HHH for UltraJudge vs 2. MT Bench for UltraJudge Accuracy: 0.744, Recall: 0.744, Precision: 1.000, F1: 0.853 Accuracy: 0.290, Recall: 0.290, Precision: 0.325, F1: 0.292
-
HHH for JudgeLM vs 2. MT Bench for JudgeLM Accuracy: 0.767, Recall: 0.767, Precision: 1.000, F1: 0.868 Accuracy: 0.390, Recall: 0.390, Precision: 0.325, F1: 0.341
this is awesome @zucchini-nlp !
I think the code to run the benchmark would be a great contribution to the repo!
Yes, I opened PR 131
@dvsrepo Close as completed?