OpenRLHF
OpenRLHF copied to clipboard
An Easy-to-use, Scalable and High-performance RLHF Framework (Support 70B+ full tuning & LoRA & Mixtral & KTO)
Hi Team, While using the PPO pipeline we observe at times spikes in response length and were curious if any techniques related to length penalty is available or explored
If I understand the current PPO code correctly, this instantiates completely separate actor and critic models, without any layers shared between them. (But correct me in case that is wrong?)...
I noticed that RemoteExperienceMaker left-pads the input sequences even when using vllm for generation: https://github.com/OpenLLMAI/OpenRLHF/blob/dcd379a44eea56625626d1a0832cd3eeda048b21/openrlhf/trainer/ppo_utils/experience_maker.py#L346 I can see that a few lines down,`self.actor.process_sequences()` assumes this left-padding, as it calculates an...
Hi I notice you cite "70B+ Full Tuning with 16 A100" however this is also something that trlX (and that we worked very hard to add ;) ) supports via...
Hi! Thanks for your work on OpenRLHF. I trained a 4-bit Qwen-based reward model with this config (see the defaults): ``` parser.add_argument("--pretrain", type=str, default="Qwen/Qwen1.5-7B") parser.add_argument('--dataset', type=str, default='Anthropic/hh-rlhf') parser.add_argument("--dataset", type=str, default="nz/highest-number-rlhf")...
Supports pip install and pre-build containers, then all functions support one-click training by passing args.
I have some thoughts about using vLLM for generation. Feel free to correct me if I were wrong. 1. Batching It seems that prompts are still passing to vllm engines...