stable-baselines3-contrib [Feature Request] MaskableRecurrentPPO

Motivation MaskablePPO is great for large discrete action space that has many invalid actions at each step, while RecurrentPPO is useful for the agent to has a memory of previous observations and actions taken, which improves it's decision making. Right now, we have to choose between those 2 algorithms and cannot have features of both of them, which would greatly improve agents training when both action masking and sequence processing is helpful.

Feature MaskableRecurrentPPO - An algorithm that is a combination of MaskablePPO and RecurrentPPO. Or action masking integration to PPO and RecurrentPPO.

Aug 30 '22 10:08 CppMaster

Duplicate of https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/76

@dylanprins I would happy to share the link in the doc if you could open source your implementation ;)

Aug 30 '22 11:08 araffin

@dylanprins +1 If anyone has a solution for it, it can be really great :)

Oct 21 '22 14:10 rextea

I would love to try this implementation if it is available! MaskablePPO is the only algorithm that makes use of the masking technique :+1:

Feb 06 '23 09:02 rllyryan

@rllyryan Are you currently working on this?

If not, I can do this, since I haven't found any implementations of this so far :)

@araffin Should I do a PR for this or would you prefer a separate repository for this?

Feb 07 '23 13:02 philippkiesling

hould I do a PR for this or would you prefer a separate repository for this?

I would prefer a separate repo but we would put a link to it in our doc.

Feb 07 '23 13:02 araffin

I finished the first implementation of this here:

https://github.com/philippkiesling/stable-baselines3-contrib-maskable-recurrent-ppo.

So far I have only tested this on my custom Dict-Environment, but I will do more testing for other functionalities by the end of the week.

Feb 08 '23 16:02 philippkiesling

Hey @philippkiesling, I apologise for the late reply, I was focusing on tuning my custom environment (for work) for a while now, did not have the flexibility to carry out the merging. That being said, I just started out in this field, and am unlikely to have the proficiency to make this modification myself. Thank you for taking this up!

I will try to take a look at your first implementation, and understand how this merger is going to work (merge between RPPO and MaskablePPO). Currently, my environment is having Box inputs for MaskablePPO.

Feb 19 '23 03:02 rllyryan

Hi, I needed this for a paper myself, so no additional work for me :) I tried box inputs, and it should work, but please let me know if there are any problems

Feb 20 '23 11:02 philippkiesling

Sure! I will try it out in the next few days :) Will let you know if it ran to completion!

Feb 22 '23 06:02 rllyryan

Update newest recurrentmaskable PPO on https://github.com/wdlctc/recurrent_maskable based on sb3-contrib version 1.8.0 since the previous version can't work on current version of sb3-contrib. Feel free to take it.

May 07 '23 22:05 wdlctc

Update newest recurrentmaskable PPO on https://github.com/wdlctc/recurrent_maskable based on sb3-contrib version 1.8.0

Thanks for the update, do you mean it now works with current SB3 master version?

May 08 '23 08:05 araffin

Update newest recurrentmaskable PPO on https://github.com/wdlctc/recurrent_maskable based on sb3-contrib version 1.8.0

Thanks for the update, do you mean it now works with current SB3 master version?

Yes, up to now it works well on stable_baselines3 version 1.8.0 on my custom environment and public test environment.

May 09 '23 01:05 wdlctc

both are from last year without update ... someone could take over to make it available for latest versions ?

Apr 30 '24 20:04 tty666

+1 would love to see an update for 2.3.* ❤️

I made it to work with simple fixes, but in my custom Env it wont learn anything: ---------------------------------------- | rollout/ | | | ep_len_mean | 19.6 | | ep_rew_mean | 416 | | time/ | | | fps | 539 | | iterations | 7 | | time_elapsed | 99 | | total_timesteps | 53760 | | train/ | | | approx_kl | 43809744.0 | | clip_fraction | 0.841 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | -0.00137 | | learning_rate | 0.0003 | | loss | 9.57e+04 | | n_updates | 24 | | policy_gradient_loss | 0.423 | | value_loss | 1.11e+05 | ----------------------------------------

This approx_kl is hilarious..

I can run my custom env with either Maskable or Recurrent and the values look normal, but as soon as it is combined it goes crazy. I double checked the code base and its simply a merge of both algorithms. I dont see any problem here, but im no expert.

I had my Env run for 10m steps and it had learned nothing, episode len and mean reward were the same +/-.

I could pin it down to the following line inside of distributions.py:

HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device)

If i lower this number to something like -1e5 i dont get 43m for approx_kl anymore but about 43k instead. So the same factor of reduction applied to this value applies to the approx_kl.

@araffin Any ideas how to keep the value high enough to keep the functionality of masking without adding the same factor to the approx_kl value?

May 09 '24 16:05 Maxxxel

stable-baselines3-contrib stable-baselines3-contrib copied to clipboard

[Feature Request] MaskableRecurrentPPO

stable-baselines3-contrib
stable-baselines3-contrib copied to clipboard