ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B)

Open yynil opened this issue 2 years ago • 1 comments

Describe the feature

The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at the same time. The Actor/Initial models' outputs are ids which means actions for Reward/Critic model. If reward model and actor model don't share the same tokenizer, the Ids mean nothing for reward model.

Even for the same model like bloom, developers can't keep the strong assumption that different scale models share the same tokenizer. For an example, bloom7b-mt doesn't need to share the same tokenizer with bloom-560m.

Things get even worse if we only have one LLM, like ChatGLM-6B. We even don't have chance to bet a smaller model has the same tokenizer.

So a video ram friendly PPO trainer is needed, so we only need to keep on model in video ram to do the training.

I have finished the codes and Readme doc in my fork. Later I'll submit a PR for this feature.

yynil avatar Apr 14 '23 10:04 yynil

Hi @yynil Thank you very much for your proposal and contribution. Looking forward to your further PR updates. Thanks.

binmakeswell avatar Apr 18 '23 06:04 binmakeswell