trlx Large reward model issue.

If the reward model cannot fit on a single GPU, which will be the case when we are training our instruct GPT model, then the current system fails since you would have to run two accelerate instances at once.

Oct 11 '22 21:10 LouisCastricato

It might be possible to initialize the reward model seperately with zero inference[1]. Afaik accelerate by itself doesn't support second dsconfig. [1] https://www.deepspeed.ai/2022/09/09/zero-inference.html

Oct 11 '22 22:10 maxreciprocate

This issue is now of a higher priority. I'm trying to get an 11b reward model working.

Oct 31 '22 11:10 LouisCastricato

What is your 11b reward model? Seems rare to find one. #146

Dec 24 '22 01:12 James4Ever0

FLAN T5 11B

Dec 24 '22 02:12 LouisCastricato

Hi, is there any update on this? Can also ask on discord

Jan 23 '23 20:01 marcobellagente93

@Dahoas regularly uses 6b and 20b RMs. The key is to just set up triton server on its own GPU or set of GPUs. Clouding since we're merging something related to this soon.

Jan 23 '23 21:01 LouisCastricato

@LouisCastricato is there a guide or pointer on how to do this? Or has the change you're referring to been merged?

I'd also like to be able to share parameters between the policy and reward model (i.e. they're all the same base model w/ frozen layers), but that's a separate issue.

Mar 21 '23 11:03 RobertKirk

cc @reciprocated has some code to deal with large RMs

Mar 21 '23 11:03 LouisCastricato

Hey @RobertKirk, for larger reward models you can adopt code at https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L113-L183. With it you either host a reward model using triton server's gRPC endpoint or dedicate a separate GPU on the same machine used in training (https://github.com/CarperAI/trlx/tree/main/examples/hh). Resharing parameters between policy and reward model however is not supported yet.

Mar 22 '23 07:03 maxreciprocate

Thanks, that's very useful!

Mar 28 '23 09:03 RobertKirk

trlx trlx copied to clipboard

Large reward model issue.

trlx
trlx copied to clipboard