cleanrl icon indicating copy to clipboard operation
cleanrl copied to clipboard

Adding Average Reward PPO proposal

Open Howuhh opened this issue 2 years ago • 3 comments

Although it is now common to solve most problems using discounted reward, this does not always correspond to the real problem (not episodic, long-horizon), where it is important to use algorithms that optimize the average reward.

There are only two adaptations of modern algorithms for average-reward setting: A-TRPO, A-PPO. A-PPO is almost the same as regular PPO and much easier to implement than A-TRPO. Besides, it also solves some problems [1, 2] of regular PPO (e.g. sampling from undiscounted state distribution).

In my free time, I tried to reproduce the results of the paper and it kinda worked. Nevertheless, I think it is important to compare it with the other algorithms, esp PPO. This will be easier to do if they share the same code base and common hacks. So it seems to me that adding it to cleanrl will help with this. In addition, I think average reward setting is very underrated and this will help popularize it (if A-PPO really works).

Howuhh avatar Jun 20 '22 15:06 Howuhh

This looks pretty interesting - APPO would be a new algorithm. I think it's great to have it. Would you be up to going through the new algorithm contribution checklist? See https://github.com/vwxyzjn/cleanrl/pull/186 as an example.

vwxyzjn avatar Jun 20 '22 15:06 vwxyzjn

Yes, but it will take some time, especially documentation and testing. I think it would be reasonable to start from apo_continuous_action.py and compare it only with ppo_continuous_action.py as a test.

Howuhh avatar Jun 20 '22 15:06 Howuhh

That makes sense. Thank you!

vwxyzjn avatar Jun 20 '22 15:06 vwxyzjn