distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)

Open dvsrepo opened this issue 1 year ago • 4 comments

The idea would be to build and run a benchmark with at least the following datasets: HHH Alignment & MT Bench Human Judgment.

Our current preference task are:

  • UltraFeedback: with different aspects (honest, helpful, etc.) so we need to see how to compute the overall benchmark. We can start with the for_text_quality() aspect because it's a summary of the other aspects.
  • JudgeLM
  • UltraJudge: our own variation of uf and judgelm

The main idea is to compute the chosen and rejected response and compare it with the ones in the benchmark. Based on this we can compute typical classification metrics (accuracy, precision, recall, f1)

This benchmark will be very useful as we can run it when we develop or integrate new techniques.

  • we can start with a sample of each dataset is large

dvsrepo avatar Nov 27 '23 08:11 dvsrepo

I ran the datasets for each of the tasks. From the hhh alignment dataset, I took only the "other" part, and for the mt bench the first 100 questions. The mt bench dataset has also a "tie" label, so it is more of a multiclass classification. As an LLM I used ChatGPT.

In general, the hhh alignment is labelled almost in the same was as humans would, while mt bench has very low agreement with humans. For now I only ran to get metrics and I plan to look why exactly the mt bench is getting such scores. Below are the results for now:

  1. HHH for Text Quality vs 2. MT Bench for Text Quality Accuracy: 0.930, Recall: 0.930, Precision: 1.000, F1: 0.964 Accuracy: 0.350, Recall: 0.350, Precision: 0.357, F1: 0.351

  2. HHH for UltraJudge vs 2. MT Bench for UltraJudge Accuracy: 0.744, Recall: 0.744, Precision: 1.000, F1: 0.853 Accuracy: 0.290, Recall: 0.290, Precision: 0.325, F1: 0.292

  3. HHH for JudgeLM vs 2. MT Bench for JudgeLM Accuracy: 0.767, Recall: 0.767, Precision: 1.000, F1: 0.868 Accuracy: 0.390, Recall: 0.390, Precision: 0.325, F1: 0.341

zucchini-nlp avatar Nov 28 '23 20:11 zucchini-nlp

this is awesome @zucchini-nlp !

I think the code to run the benchmark would be a great contribution to the repo!

dvsrepo avatar Nov 29 '23 14:11 dvsrepo

Yes, I opened PR 131

zucchini-nlp avatar Nov 29 '23 18:11 zucchini-nlp

@dvsrepo Close as completed?

ashim-mahara avatar Aug 07 '24 21:08 ashim-mahara