agents icon indicating copy to clipboard operation
agents copied to clipboard

PPOAgent Entropy Regularization, Clipping, GAE are working Incorrectly

Open kochlisGit opened this issue 2 years ago • 3 comments

I have been trying to implement a PPO Agent that solves LunarLander-v2 as in the official example in the github repo: https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ppo/examples/v2/train_eval_clip_agent.py

In this example, a PPOClip agent is used. However, I would like to use both Clipping & KL-Penalty, so I used the PPOAgent class, that provides both options according to the documentation here: https://www.tensorflow.org/agents/api_docs/python/tf_agents/agents/PPOAgent

As You may notice, all the parameters of the KL-Penalty have already been selected from the original paper. However, importance_ratio_clipping (Clipping) , entropy_regularization (Entropy Coeff) , use_gae (Generalized Advantage Estimation) are set to 0.

I tried leaving the rest of parameters as they are, but made the following changes:

importance_ratio_clipping=0.3 entropy_regularization=0.01 use_gae=True

While the original PPOAgent, without changing those parameters, works perfectly, when changing on of those parameters or all of them, the agent diverges quickly and always gets negative rewards, no matter how much I train them. At first, I ran those experiments many times to see If I can get a better solution, but the algorithm always did very poorly.

In order to test if this is a bug of tf-agents PPOAgent class, I decided to run the same algorithm with the same parameters using RLLib. Also, i changed the rest of the parameters, so that they match the default ones of tf-agents. Surprisingly, their implementation has no problem at converging at all, using Clipping, KL-penalty, Entropy Coeff & GAE at the same time! Here are the results:

https://github.com/kochlisGit/DRL-Frameworks/blob/main/rllib/ppo_average_return.png

kochlisGit avatar Nov 25 '21 09:11 kochlisGit

Thank you for raising this issue, and your detailed description of the problem!

Could you please try this with our new examples https://github.com/tensorflow/agents/blob/master/tf_agents/examples/ppo/schulman17/ppo_clip_train_eval.py

The new versions of examples are nightly tested and verified against reported numbers from the paper (thus more reliable).

Once you try the new examples, could you verify 1. whether you get expected learning with the schulman17 parameters with just clipping 2. when you add the KL terms, does it stop learning?

This way it will help us narrow down the where the issues are. My guess is that something with the KL related implementation might be scaled differently or something. That logic is less widely used than the clipping version. Thanks in advance!

summer-yue avatar Dec 08 '21 21:12 summer-yue

In the mean time, our team will look into pointing users towards our new examples better, as opposed to the older version. Thanks again.

summer-yue avatar Dec 08 '21 21:12 summer-yue

I have have tested 3 times PPOClipAgent in LunarLander-v2 using the example https://github.com/tensorflow/agents/blob/master/tf_agents/examples/ppo/schulman17/ppo_clip_train_eval.py

It is working fine. The agent learns & converges quickly. Then I added entropy regularization = 0.01, which didn't change much in the training process (GAE was True by default). . In order to use KL parameters, I had to switch from PPOClipAgent to PPOAgent and set importance_ratio_clipping=0.2. This didn't have the expected returns.

kochlisGit avatar Dec 10 '21 09:12 kochlisGit