llm_rlhf
llm_rlhf copied to clipboard
realize the reinforcement learning training for gpt2 llama bloom and so on llm model
rewardmodel是打分模型,可否用人工代替? 若人工代替,则只需要组建三元组和对应分数,即可用强化学习的思路训练模型对吗?
Requirement already satisfied: py-cpuinfo in /home/liuhaiying/anaconda3/envs/ss/lib/python3.10/site-packages (from deepspeed->-r requirements.txt (line 3)) (9.0.0) Requirement already satisfied: pydantic-r requirements.txt (line 3)) (1.10.7) Requirement already satisfied: torch in /home/liuhaiying/anaconda3/envs/ss/lib/python3.10/site-packages (from deepspeed->-r requirements.txt (line...
Now I have implemented Qlora for SFT and reward model but I am quite confused when I do Qlora for PPO, do you plan to integrate PPO into repo?