tianshou icon indicating copy to clipboard operation
tianshou copied to clipboard

Multiagent with different Action and State spaces

Open levifussell opened this issue 2 years ago • 3 comments

  • [x] I have marked all applicable categories:
    • [x] exception-raising bug
    • [ ] RL algorithm bug
    • [ ] documentation request (i.e. "X is missing from the documentation.")
    • [ ] new feature request
  • [x] I have visited the source website
  • [x] I have searched through the issue tracker for duplicates
  • [x] I have mentioned version numbers, operating system and environment, where applicable:

Tianshou v0.4.5

My current project requires two agents in a shared environment, each of which have different state and action spaces. I've been messing around with the Tianshou MultiAgent setup to get this to work, but everything is pointing towards the fact that what I need isn't supported.

It seems the biggest issue is maintaining two different array sizes inside a shared Buffer, where Tianshou will by default store these as np.objects to avoid stacking the different sizes. I've dug into the Batch class and Buffer, but I'll admit I haven't fully unpicked what is going on. But it seems to me this functionality isn't supported? Are there any recommended workarounds?

Cheers.

levifussell avatar Mar 03 '22 23:03 levifussell

Maybe switch to something like:

observation_space: DictSpace(...)

obs = {"player_a": np.array(...), "player_b": np.array(...)}  # for all players
# if only player_a, fill player_b with np.zeros_like(...)

Trinkle23897 avatar Mar 04 '22 00:03 Trinkle23897

Huh, I didn't know about the DictSpace class.

I'm guessing a DictSpace can also be used for action_space. And processing both of these will require some manual work to split the dicts for passing to the RL algorithms?

One workaround I did was to buffer the observations and actions with zeros. These shouldn't have any effect on PPO training, as they are values without gradients...

levifussell avatar Mar 04 '22 00:03 levifussell

Yep... Basically what the MAPM does is to split the whole observation into several folds and send it to each policy. Also for the action: concat at the end, then return.

Trinkle23897 avatar Mar 04 '22 00:03 Trinkle23897