[RLlib][DO NOT MERGE] PPO RLHF Example
Why are these changes needed?
This PR is an attempt to create an RLHF pipeline example in rllib. Doing this will achieve a couple of things:
- Evaluate if our RLModule / Learner APIs are simple to use, robust and sufficient for this application.
- It will put a spotlight on other short-comings of RLlib that needs to be fixed (sampler issues)
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
When the run.py script is executed, the training log displays that the rewards are "nan", as shown below:
+-----------------------------+----------+-------------------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|-----------------------------+----------+-------------------+--------+------------------+------+----------+----------------------+----------------------+--------------------|
| PPORLHF_RLHFEnv_71501_00000 | RUNNING | 10.0.0.156:166214 | 2 | 106.726 | 4 | nan | nan | nan | nan |
+-----------------------------+----------+-------------------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
The reason for this problem is the lack of evaluation, maybe you can add .evaluation config in PPOConfig and add a custom evaluate function in class PPORLHF
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Hello, could you please let me know the reason behind the closure of this PR? I'm planning to apply RLHF with Ray in my project and would appreciate any feedback on why the example didn't merge. Thank you for your help.
This PR was more of a prototype at the time and probably won't scale to super large models. Also RLHF does not have much demand in the market hence closing this PR.
Thank you for your response. Do you mean RLHF can function properly in Ray RLlib?