trlx icon indicating copy to clipboard operation
trlx copied to clipboard

Large reward model issue.

Open LouisCastricato opened this issue 2 years ago • 4 comments

If the reward model cannot fit on a single GPU, which will be the case when we are training our instruct GPT model, then the current system fails since you would have to run two accelerate instances at once.

LouisCastricato avatar Oct 11 '22 21:10 LouisCastricato

It might be possible to initialize the reward model seperately with zero inference[1]. Afaik accelerate by itself doesn't support second dsconfig. [1] https://www.deepspeed.ai/2022/09/09/zero-inference.html

maxreciprocate avatar Oct 11 '22 22:10 maxreciprocate

This issue is now of a higher priority. I'm trying to get an 11b reward model working.

LouisCastricato avatar Oct 31 '22 11:10 LouisCastricato

What is your 11b reward model? Seems rare to find one. #146

James4Ever0 avatar Dec 24 '22 01:12 James4Ever0

FLAN T5 11B

LouisCastricato avatar Dec 24 '22 02:12 LouisCastricato

Hi, is there any update on this? Can also ask on discord

marcobellagente93 avatar Jan 23 '23 20:01 marcobellagente93

@Dahoas regularly uses 6b and 20b RMs. The key is to just set up triton server on its own GPU or set of GPUs. Clouding since we're merging something related to this soon.

LouisCastricato avatar Jan 23 '23 21:01 LouisCastricato

@LouisCastricato is there a guide or pointer on how to do this? Or has the change you're referring to been merged?

I'd also like to be able to share parameters between the policy and reward model (i.e. they're all the same base model w/ frozen layers), but that's a separate issue.

RobertKirk avatar Mar 21 '23 11:03 RobertKirk

cc @reciprocated has some code to deal with large RMs

LouisCastricato avatar Mar 21 '23 11:03 LouisCastricato

Hey @RobertKirk, for larger reward models you can adopt code at https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L113-L183. With it you either host a reward model using triton server's gRPC endpoint or dedicate a separate GPU on the same machine used in training (https://github.com/CarperAI/trlx/tree/main/examples/hh). Resharing parameters between policy and reward model however is not supported yet.

maxreciprocate avatar Mar 22 '23 07:03 maxreciprocate

Thanks, that's very useful!

RobertKirk avatar Mar 28 '23 09:03 RobertKirk