stable-baselines3-contrib
stable-baselines3-contrib copied to clipboard
[Feature Request] MaskableRecurrentPPO
Motivation MaskablePPO is great for large discrete action space that has many invalid actions at each step, while RecurrentPPO is useful for the agent to has a memory of previous observations and actions taken, which improves it's decision making. Right now, we have to choose between those 2 algorithms and cannot have features of both of them, which would greatly improve agents training when both action masking and sequence processing is helpful.
Feature MaskableRecurrentPPO - An algorithm that is a combination of MaskablePPO and RecurrentPPO. Or action masking integration to PPO and RecurrentPPO.
Duplicate of https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/76
@dylanprins I would happy to share the link in the doc if you could open source your implementation ;)
@dylanprins +1 If anyone has a solution for it, it can be really great :)
I would love to try this implementation if it is available! MaskablePPO is the only algorithm that makes use of the masking technique :+1:
@rllyryan Are you currently working on this?
If not, I can do this, since I haven't found any implementations of this so far :)
@araffin Should I do a PR for this or would you prefer a separate repository for this?
hould I do a PR for this or would you prefer a separate repository for this?
I would prefer a separate repo but we would put a link to it in our doc.
I finished the first implementation of this here:
https://github.com/philippkiesling/stable-baselines3-contrib-maskable-recurrent-ppo.
So far I have only tested this on my custom Dict-Environment, but I will do more testing for other functionalities by the end of the week.
Hey @philippkiesling, I apologise for the late reply, I was focusing on tuning my custom environment (for work) for a while now, did not have the flexibility to carry out the merging. That being said, I just started out in this field, and am unlikely to have the proficiency to make this modification myself. Thank you for taking this up!
I will try to take a look at your first implementation, and understand how this merger is going to work (merge between RPPO and MaskablePPO). Currently, my environment is having Box inputs for MaskablePPO.
Hi, I needed this for a paper myself, so no additional work for me :) I tried box inputs, and it should work, but please let me know if there are any problems
Sure! I will try it out in the next few days :) Will let you know if it ran to completion!
Update newest recurrentmaskable PPO on https://github.com/wdlctc/recurrent_maskable based on sb3-contrib version 1.8.0 since the previous version can't work on current version of sb3-contrib. Feel free to take it.
Update newest recurrentmaskable PPO on https://github.com/wdlctc/recurrent_maskable based on sb3-contrib version 1.8.0
Thanks for the update, do you mean it now works with current SB3 master version?
Update newest recurrentmaskable PPO on https://github.com/wdlctc/recurrent_maskable based on sb3-contrib version 1.8.0
Thanks for the update, do you mean it now works with current SB3 master version?
Yes, up to now it works well on stable_baselines3 version 1.8.0 on my custom environment and public test environment.
both are from last year without update ... someone could take over to make it available for latest versions ?
+1 would love to see an update for 2.3.* ❤️
I made it to work with simple fixes, but in my custom Env it wont learn anything:
---------------------------------------- | rollout/ | | | ep_len_mean | 19.6 | | ep_rew_mean | 416 | | time/ | | | fps | 539 | | iterations | 7 | | time_elapsed | 99 | | total_timesteps | 53760 | | train/ | | | approx_kl | 43809744.0 | | clip_fraction | 0.841 | | clip_range | 0.2 | | entropy_loss | -3.15 | | explained_variance | -0.00137 | | learning_rate | 0.0003 | | loss | 9.57e+04 | | n_updates | 24 | | policy_gradient_loss | 0.423 | | value_loss | 1.11e+05 | ----------------------------------------
This approx_kl is hilarious..
I can run my custom env with either Maskable or Recurrent and the values look normal, but as soon as it is combined it goes crazy. I double checked the code base and its simply a merge of both algorithms. I dont see any problem here, but im no expert.
I had my Env run for 10m steps and it had learned nothing, episode len and mean reward were the same +/-.
I could pin it down to the following line inside of distributions.py:
HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device)
If i lower this number to something like -1e5 i dont get 43m for approx_kl anymore but about 43k instead. So the same factor of reduction applied to this value applies to the approx_kl.
@araffin Any ideas how to keep the value high enough to keep the functionality of masking without adding the same factor to the approx_kl value?