reward-bench icon indicating copy to clipboard operation
reward-bench copied to clipboard

Best of N benchmark

Open natolambert opened this issue 1 year ago • 2 comments
trafficstars

  1. Take a few chat models as the “base set”, say 1-3, like tulu 2 7b and tulu 2 13b (maybe olmo-instruct)
  2. Generate ~8 completions per prompt in AlpacaEval (this is the heldout set)
  3. Use each RM to choose the best-of-1 from that set, then run alpaca eval on the outputs
  4. Score the delta for each RM in the batch on a set task (alpacaeval) and set base model (tulu)
  5. Could do this with MTBench, but two turn is harder

Obvi flaws, but that seems WAY better than nothing.

natolambert avatar Feb 08 '24 02:02 natolambert

@yuchenlin is starting this, woohoo!

natolambert avatar Feb 12 '24 22:02 natolambert

Partially closed in #30 , wrapping up soon.

natolambert avatar Feb 22 '24 20:02 natolambert