trlx
trlx copied to clipboard
Large reward model issue.
If the reward model cannot fit on a single GPU, which will be the case when we are training our instruct GPT model, then the current system fails since you would have to run two accelerate instances at once.
It might be possible to initialize the reward model seperately with zero inference[1]. Afaik accelerate by itself doesn't support second dsconfig. [1] https://www.deepspeed.ai/2022/09/09/zero-inference.html
This issue is now of a higher priority. I'm trying to get an 11b reward model working.
What is your 11b reward model? Seems rare to find one. #146
FLAN T5 11B
Hi, is there any update on this? Can also ask on discord
@Dahoas regularly uses 6b and 20b RMs. The key is to just set up triton server on its own GPU or set of GPUs. Clouding since we're merging something related to this soon.
@LouisCastricato is there a guide or pointer on how to do this? Or has the change you're referring to been merged?
I'd also like to be able to share parameters between the policy and reward model (i.e. they're all the same base model w/ frozen layers), but that's a separate issue.
cc @reciprocated has some code to deal with large RMs
Hey @RobertKirk, for larger reward models you can adopt code at https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L113-L183. With it you either host a reward model using triton server's gRPC endpoint or dedicate a separate GPU on the same machine used in training (https://github.com/CarperAI/trlx/tree/main/examples/hh). Resharing parameters between policy and reward model however is not supported yet.
Thanks, that's very useful!