trl Add Best of N sampling

What's the right place to add best of n sampling and compare its impact to some existing methods?

Some references:

Discussed in reward model scaling laws paper,
OpenAI blog post,
Used in WebGPT

Feb 09 '23 16:02 natolambert

That would be something you would do after training right? So you would just take a model you want to evaluate, generate n candidates and sort them with the reward model? We can easily do that in a notebook/script in examples? Or do you think it requires custom functionality in the library?

Feb 09 '23 17:02 lvwerra

Yeah @lvwerra it would just be an example / documentation addition I bet. Or, a more advanced option would be to explain the differences a bit for people too.

Feb 10 '23 03:02 natolambert

Indeed, no additional training is required from this method, but you still need an optimization loop. One idea would be to wrap that logic in a helper class like BestOfNSampler, but starting with a notebook sounds like a good way to go!

Personally, I find the simplicity of the approach really appealing for setting non-PPO baselines :)

Feb 10 '23 08:02 lewtun

@natolambert was wondering if this up for grabs?

Apr 25 '23 05:04 metric-space

Haven't done it, happy to review your PR if you make one. Generally, I had written out psuedo code here

query_tensors = [query_tensor]*batch_size
model.generate(query_tensors,
                return_prompt=training_args.return_prompt,
                generation_config=generation_config,
            )
batch["response"] = tokenizer.batch_decode(
       response_tensors, skip_special_tokens=training_args.decode_skip_special_tokens  # default to True
)

texts = [q + "\n" + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_pipe(texts, **reward_pipeline_kwargs)

# Collate the rewards minus the baseline (for zero mean)
rewards = [torch.tensor(output[0]["score"]) for output in pipe_outputs]
best_output_idx = torch.argmax(rewards)
output = batch["response"][best_output_idx]

Apr 25 '23 17:04 natolambert

@natolambert I have something up as a draft PR https://github.com/lvwerra/trl/pull/326 that implements your pseudo code

I haven't trained anything but pulled down models from huggingface_hub that most likely are models trained from the sentiment notebook (in the examples folder). I may be misunderstanding things but despite implementing I have no clue what the zero mean stuff is showcasing but that could probably be the aftermath of having implemented the general best-of-n idea wrong

FYI haven't trained anything as I'm not sure if I'm on the right track and putting training code in the PR is just a matter of copy pasting stuff in from the notebook present in the examples folder

As an aside and maybe I'm looking too much into this:

context: from the above conversation

That would be something you would do after training right?

Indeed, no additional training is required from this method, but you still need an optimization loop

Aren't both of these contradicting each other or merely two different uses cases of the best-of-n idea?

Apr 27 '23 03:04 metric-space

Ah, I just realized something. I guess I wasn't sure where exactly in the pipeline I was supposed to be placing the best-of-n-sampling and I placed it after the RL part

from the WebGPT paper

We sampled a fixed number of answers (4, 16 or 64) from either the BC model or the RL model (if left unspecified, we used the BC model),

I could be mistaken but based off

I find the simplicity of the approach really appealing for setting non-PPO baselines :)

I think it makes sense to make my PR a ref vs ppo vs non-ppo (best-of)

May 02 '23 03:05 metric-space