reward-bench
reward-bench copied to clipboard
Best of N benchmark
trafficstars
- Take a few chat models as the “base set”, say 1-3, like tulu 2 7b and tulu 2 13b (maybe olmo-instruct)
- Generate ~8 completions per prompt in AlpacaEval (this is the heldout set)
- Use each RM to choose the best-of-1 from that set, then run alpaca eval on the outputs
- Score the delta for each RM in the batch on a set task (alpacaeval) and set base model (tulu)
- Could do this with MTBench, but two turn is harder
Obvi flaws, but that seems WAY better than nothing.
@yuchenlin is starting this, woohoo!
Partially closed in #30 , wrapping up soon.