Add Best of N sampling
What's the right place to add best of n sampling and compare its impact to some existing methods?
Some references:
- Discussed in reward model scaling laws paper,
- OpenAI blog post,
- Used in WebGPT
That would be something you would do after training right? So you would just take a model you want to evaluate, generate n candidates and sort them with the reward model? We can easily do that in a notebook/script in examples? Or do you think it requires custom functionality in the library?
Yeah @lvwerra it would just be an example / documentation addition I bet. Or, a more advanced option would be to explain the differences a bit for people too.
Indeed, no additional training is required from this method, but you still need an optimization loop. One idea would be to wrap that logic in a helper class like BestOfNSampler, but starting with a notebook sounds like a good way to go!
Personally, I find the simplicity of the approach really appealing for setting non-PPO baselines :)
@natolambert was wondering if this up for grabs?
Haven't done it, happy to review your PR if you make one. Generally, I had written out psuedo code here
query_tensors = [query_tensor]*batch_size
model.generate(query_tensors,
return_prompt=training_args.return_prompt,
generation_config=generation_config,
)
batch["response"] = tokenizer.batch_decode(
response_tensors, skip_special_tokens=training_args.decode_skip_special_tokens # default to True
)
texts = [q + "\n" + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_pipe(texts, **reward_pipeline_kwargs)
# Collate the rewards minus the baseline (for zero mean)
rewards = [torch.tensor(output[0]["score"]) for output in pipe_outputs]
best_output_idx = torch.argmax(rewards)
output = batch["response"][best_output_idx]
@natolambert I have something up as a draft PR https://github.com/lvwerra/trl/pull/326 that implements your pseudo code
I haven't trained anything but pulled down models from huggingface_hub that most likely are models trained from the sentiment notebook (in the examples folder). I may be misunderstanding things but despite implementing I have no clue what the zero mean stuff is showcasing but that could probably be the aftermath of having implemented the general best-of-n idea wrong
FYI haven't trained anything as I'm not sure if I'm on the right track and putting training code in the PR is just a matter of copy pasting stuff in from the notebook present in the examples folder
As an aside and maybe I'm looking too much into this:
context: from the above conversation
That would be something you would do after training right?
Indeed, no additional training is required from this method, but you still need an optimization loop
Aren't both of these contradicting each other or merely two different uses cases of the best-of-n idea?
Ah, I just realized something. I guess I wasn't sure where exactly in the pipeline I was supposed to be placing the best-of-n-sampling and I placed it after the RL part
from the WebGPT paper
We sampled a fixed number of answers (4, 16 or 64) from either the BC model or the RL model (if left unspecified, we used the BC model),
I could be mistaken but based off
I find the simplicity of the approach really appealing for setting non-PPO baselines :)
I think it makes sense to make my PR a ref vs ppo vs non-ppo (best-of)