FasterTransformer reward model support
🚀 The feature, motivation, and pitch
We need the ability to use massive reward models, as this will be necessary for our Instruct GPT model. Currently the size of the reward model is greatly limited and using GPU accelerators for them comes with weird sets of limitations.
Alternatives
We could alternatively use a different accelerate script for the reward model, or include the reward model within the student class. Doing the latter would be trivial but result in kind of gross code and not very easily extendible.
Additional context
No response
Couldn't you run RM on a separate node and send requests for comparison over the network?
You could! I think thats actually along the lines of the solutions we're looking at (also doing roll outs on a separate node for PPO). I think we want a very easy solution for the end user though -- where they don't really need to think about the size of their reward model if they have enough GPU horsepower.
Cool, not sure if it's a good fit but you could deploy the RM in Triton and invoke it via tritonclient. That's what I would do but perhaps it's not a good fit for your end users.
Yeah that's what we're doing internally.
Could I ask what reward models you are using? Seems rare to find one.
We need the ability to use massive reward models
I was using FLAN T5 11B zero shot. @Dahoas has multiple 6B finetuned RMs though.
FLAN T5 11B
I've reviewed your code (or you have modified it somehow). I think the prompt format needs change to adapt multiline answers.
# give prompt and compare which is better?
# to predict logits, (decide is A or B)
# reference: https://yjernite.github.io/lfqa.html
special_token = "<P>" # hey you make sure this exists in the tokenizer, since it differs with model
def replaceTillNothingLeft(string:str, objective:str,target:str=""):
while objective in string:
string = string.replace(objective,target)
return string
RTNL = lambda x: replaceTillNothingLeft(x,special_token) # shame we cannot typehint you! is it?
question = "What is the most beautiful thing in this world?"
ans_0 = "Frog."
ans_1 = "Cat."
mprompt = f"Given the question and two answers, find the better answer.{special_token}Question: {RTNL(question)}{special_token}A: {RTNL(ans_0)}{special_token}B: {RTNL(ans_1)}{special_token}Mark it as A or B."
I don't think so? It works fine without this.
Addressed by Triton Inference Server client https://github.com/CarperAI/trlx/tree/add-hh-example