ray [RLlib][DO NOT MERGE] PPO RLHF Example

Why are these changes needed?

This PR is an attempt to create an RLHF pipeline example in rllib. Doing this will achieve a couple of things:

Evaluate if our RLModule / Learner APIs are simple to use, robust and sufficient for this application.
It will put a spotlight on other short-comings of RLlib that needs to be fixed (sampler issues)

Related issue number

Checks

[ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Apr 07 '23 19:04 kouroshHakha

When the run.py script is executed, the training log displays that the rewards are "nan", as shown below:

+-----------------------------+----------+-------------------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc               |   iter |   total time (s) |   ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+-------------------+--------+------------------+------+----------+----------------------+----------------------+--------------------|
| PPORLHF_RLHFEnv_71501_00000 | RUNNING  | 10.0.0.156:166214 |      2 |          106.726 |    4 |      nan |                  nan |                  nan |                nan |
+-----------------------------+----------+-------------------+--------+------------------+------+----------+----------------------+----------------------+--------------------+

The reason for this problem is the lack of evaluation, maybe you can add .evaluation config in PPOConfig and add a custom evaluate function in class PPORLHF

May 11 '23 08:05 zhangjian94cn

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Jun 10 '23 12:06 stale[bot]

Hello, could you please let me know the reason behind the closure of this PR? I'm planning to apply RLHF with Ray in my project and would appreciate any feedback on why the example didn't merge. Thank you for your help.

Jul 25 '23 07:07 PurpleSand123

This PR was more of a prototype at the time and probably won't scale to super large models. Also RLHF does not have much demand in the market hence closing this PR.

Jul 25 '23 14:07 kouroshHakha

Thank you for your response. Do you mean RLHF can function properly in Ray RLlib?

Jul 26 '23 00:07 PurpleSand123