llm_rlhf icon indicating copy to clipboard operation
llm_rlhf copied to clipboard

realize the reinforcement learning training for gpt2 llama bloom and so on llm model

Results 3 llm_rlhf issues
Sort by recently updated
recently updated
newest added

rewardmodel是打分模型,可否用人工代替? 若人工代替,则只需要组建三元组和对应分数,即可用强化学习的思路训练模型对吗?

Requirement already satisfied: py-cpuinfo in /home/liuhaiying/anaconda3/envs/ss/lib/python3.10/site-packages (from deepspeed->-r requirements.txt (line 3)) (9.0.0) Requirement already satisfied: pydantic-r requirements.txt (line 3)) (1.10.7) Requirement already satisfied: torch in /home/liuhaiying/anaconda3/envs/ss/lib/python3.10/site-packages (from deepspeed->-r requirements.txt (line...

安装环境

Now I have implemented Qlora for SFT and reward model but I am quite confused when I do Qlora for PPO, do you plan to integrate PPO into repo?