reward-bench Best of N benchmark

Best of N benchmark

Open natolambert opened this issue 1 year ago • 2 comments

trafficstars

Take a few chat models as the “base set”, say 1-3, like tulu 2 7b and tulu 2 13b (maybe olmo-instruct)
Generate ~8 completions per prompt in AlpacaEval (this is the heldout set)
Use each RM to choose the best-of-1 from that set, then run alpaca eval on the outputs
Score the delta for each RM in the batch on a set task (alpacaeval) and set base model (tulu)
Could do this with MTBench, but two turn is harder

Obvi flaws, but that seems WAY better than nothing.

Feb 08 '24 02:02 natolambert

@yuchenlin is starting this, woohoo!

Feb 12 '24 22:02 natolambert

Partially closed in #30 , wrapping up soon.

Feb 22 '24 20:02 natolambert