distilabel [FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)

[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)

Open dvsrepo opened this issue 2 years ago • 4 comments

The idea would be to build and run a benchmark with at least the following datasets: HHH Alignment & MT Bench Human Judgment.

Our current preference task are:

UltraFeedback: with different aspects (honest, helpful, etc.) so we need to see how to compute the overall benchmark. We can start with the for_text_quality() aspect because it's a summary of the other aspects.
JudgeLM
UltraJudge: our own variation of uf and judgelm

The main idea is to compute the chosen and rejected response and compare it with the ones in the benchmark. Based on this we can compute typical classification metrics (accuracy, precision, recall, f1)

This benchmark will be very useful as we can run it when we develop or integrate new techniques.

we can start with a sample of each dataset is large

Nov 27 '23 08:11 dvsrepo

I ran the datasets for each of the tasks. From the hhh alignment dataset, I took only the "other" part, and for the mt bench the first 100 questions. The mt bench dataset has also a "tie" label, so it is more of a multiclass classification. As an LLM I used ChatGPT.

In general, the hhh alignment is labelled almost in the same was as humans would, while mt bench has very low agreement with humans. For now I only ran to get metrics and I plan to look why exactly the mt bench is getting such scores. Below are the results for now:

HHH for Text Quality vs 2. MT Bench for Text Quality Accuracy: 0.930, Recall: 0.930, Precision: 1.000, F1: 0.964 Accuracy: 0.350, Recall: 0.350, Precision: 0.357, F1: 0.351
HHH for UltraJudge vs 2. MT Bench for UltraJudge Accuracy: 0.744, Recall: 0.744, Precision: 1.000, F1: 0.853 Accuracy: 0.290, Recall: 0.290, Precision: 0.325, F1: 0.292
HHH for JudgeLM vs 2. MT Bench for JudgeLM Accuracy: 0.767, Recall: 0.767, Precision: 1.000, F1: 0.868 Accuracy: 0.390, Recall: 0.390, Precision: 0.325, F1: 0.341

Nov 28 '23 20:11 zucchini-nlp

this is awesome @zucchini-nlp !

I think the code to run the benchmark would be a great contribution to the repo!

Nov 29 '23 14:11 dvsrepo

Yes, I opened PR 131

Nov 29 '23 18:11 zucchini-nlp

@dvsrepo Close as completed?

Aug 07 '24 21:08 ashim-mahara

distilabel distilabel copied to clipboard

[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)

distilabel
distilabel copied to clipboard