stable-baselines3
stable-baselines3 copied to clipboard
[Question] Multi Output Policy Support?
Question
Are multi output policies supported yet? I see that dictionary observations are supported per the docs, however I do not see anything out multi output policies...
Additional context
I am wanting to make a wrapper around PySC2 now that dictionary observations are supported, however multiple output policy support is still required.
Checklist
- [X ] I have read the documentation (required)
- [X ] I have checked that there is no similar issue in the repo (required)
This is a feature I think would nicely complement dictionary observations nicely. In the past we talked with @araffin about this, and the biggest issues are 1) what is the correct implementation of it and 2) what to do about support for off-policy algorithms (very different implementation. I think A2C and PPO could support multiple, independent action spaces, and this should work well.
@araffin Comments? Should this be a contrib thing if DQN/SAC/TD3 implementation is not trivial or doable? At least on A2C/PPO side, independent action spaces is a common way to approach this.
I am wanting to make a wrapper around PySC2 now that dictionary observations are supported, however multiple output policy support is still required.
what type of multi output policy is required? (discrete/continuous or other?)
@araffin Comments? Should this be a contrib thing if DQN/SAC/TD3 implementation is not trivial or doable? At least on A2C/PPO side, independent action spaces is a common way to approach this.
I haven't much more comments than in https://github.com/DLR-RM/stable-baselines3/issues/349#issuecomment-800198204
At least on A2C/PPO side, independent action spaces is a common way to approach this.
ah, do you have some reference for that?
ah, do you have some reference for that?
Not a solid one right now, but at least this paper suggests to start with independent spaces before trying to investigate if adding dependencies would help. The latter would be very task specific and hardly support-able in SB3, while independent spaces would be a very easy feat, comparably.
what type of multi output policy is required? (discrete/continuous or other?)
PySC2 docs say it's a discrete, and a box (for x, y of move).
Now that I think about this, this can be done with a multidiscrete
output space with PPO
.
But this feature would be really awesome!
It seems that @adysonmaia implemented PPO with dict action space support here: https://github.com/adysonmaia/sb3-plus/blob/main/sb3_plus/mimo_ppo/ppo.py#L24
It seems that @adysonmaia implemented PPO with dict action space support here: https://github.com/adysonmaia/sb3-plus/blob/main/sb3_plus/mimo_ppo/ppo.py#L24
Hi, I just started an implementation of PPO supporting dict action space for independent actions. At the moment, there isn't any documentation or validation tests yet. However, an "official" support of this feature in either SB3 or SB3-Contrib projects would be really interesting.
@adysonmaia are you planning on adding this feature to sb3-contrib or publishing sb3-plus to install with pip? I am very insterested on this, so please tell me if it cold be soon or not. Thanks in advance
Hi @EloyAnguiano, I intend to push the sb3-plus project as a pip repository when its code is more stable and tested. For now, it's possible to install it via pip using the GitHub url. For example:
pip install git+https://github.com/adysonmaia/sb3-plus#egg=sb3-plus