OpenRLHF issues

adding length penalty to reward

1

Hi Team, While using the PPO pipeline we observe at times spikes in response length and were curious if any techniques related to length penalty is available or explored

karthik-nexusflow

Actor-Critic-Model

5

If I understand the current PPO code correctly, this instantiates completely separate actor and critic models, without any layers shared between them. (But correct me in case that is wrong?)...

mgerstgrasser

Is left-padding in PPO strictly necessary?

6

I noticed that RemoteExperienceMaker left-pads the input sequences even when using vllm for generation: https://github.com/OpenLLMAI/OpenRLHF/blob/dcd379a44eea56625626d1a0832cd3eeda048b21/openrlhf/trainer/ppo_utils/experience_maker.py#L346 I can see that a few lines down,`self.actor.process_sequences()` assumes this left-padding, as it calculates an...

mgerstgrasser

Citation or comparison to trlX and NeMo-align.

3

Hi I notice you cite "70B+ Full Tuning with 16 A100" however this is also something that trlX (and that we worked very hard to add ;) ) supports via...

LouisCastricato

Support top models stage2

model list: 1. deepseek 2. Gemma

catqaq

enhancement

Support checkpoint to prevent training from collapse

7

hijkzzz

enhancement

Loading a reward model causes ValueError: weight is on the meta device, we need a `value` to put in on 0

19

Hi! Thanks for your work on OpenRLHF. I trained a 4-bit Qwen-based reward model with this config (see the defaults): ``` parser.add_argument("--pretrain", type=str, default="Qwen/Qwen1.5-7B") parser.add_argument('--dataset', type=str, default='Anthropic/hh-rlhf') parser.add_argument("--dataset", type=str, default="nz/highest-number-rlhf")...

NZ99

Improve ease of use

1

Supports pip install and pre-build containers, then all functions support one-click training by passing args.

hijkzzz

About using vLLM for generation

5

I have some thoughts about using vLLM for generation. Feel free to correct me if I were wrong. 1. Batching It seems that prompts are still passing to vllm engines...

LSC527

help wanted

Support pipeline module such as LLaMA2Pipeline and InstructGPTPipeline

catqaq

OpenRLHF
OpenRLHF copied to clipboard

Metadata

adding length penalty to reward

Actor-Critic-Model

Is left-padding in PPO strictly necessary?

Citation or comparison to trlX and NeMo-align.

Support top models stage2

Support checkpoint to prevent training from collapse

Loading a reward model causes ValueError: weight is on the meta device, we need a `value` to put in on 0

Improve ease of use

About using vLLM for generation

Support pipeline module such as LLaMA2Pipeline and InstructGPTPipeline

← Metadata

Owner

Metadata

OpenRLHF OpenRLHF copied to clipboard

Metadata

← Metadata

Owner

Metadata

OpenRLHF
OpenRLHF copied to clipboard