Gym
Gym copied to clipboard
Docs + Environment pattern: RLHF
Use cases, pain points, and background
Description:
Design: We probably need to make some generic reward model client that can be shared infra for all RLHF environments.
Out of scope:
Acceptance Criteria:
- [ ] Gym spins up a reward model locally like in the local vLLM model flow
- [ ] Replicate the current Nemotron RLHF process