ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        [FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B)
Describe the feature
The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at the same time. The Actor/Initial models' outputs are ids which means actions for Reward/Critic model. If reward model and actor model don't share the same tokenizer, the Ids mean nothing for reward model.
Even for the same model like bloom, developers can't keep the strong assumption that different scale models share the same tokenizer. For an example, bloom7b-mt doesn't need to share the same tokenizer with bloom-560m.
Things get even worse if we only have one LLM, like ChatGLM-6B. We even don't have chance to bet a smaller model has the same tokenizer.
So a video ram friendly PPO trainer is needed, so we only need to keep on model in video ram to do the training.
I have finished the codes and Readme doc in my fork. Later I'll submit a PR for this feature.
Hi @yynil Thank you very much for your proposal and contribution. Looking forward to your further PR updates. Thanks.