pysc2 icon indicating copy to clipboard operation
pysc2 copied to clipboard

I have some questions about pysc2 baseline agent of Deepmind.

Open chris-chris opened this issue 6 years ago • 8 comments

I have some questions about pysc2 baseline agent of Deepmind.

https://deepmind.com/blog/deepmind-and-blizzard-open-starcraft-ii-ai-research-environment/

I'm trying to implement baseline agent illustrated in the pysc2 paper.

My GitHub: https://github.com/chris-chris/pysc2-examples

Question1. How can I implement 'the policy in an auto-regressive manner, utilizing the chain rule'?

  1. I got the idea of masking unavailable actions from policy network result.
  2. But I don't get the idea of 'the policy in an auto-regressive manner, utilizing the chain rule' If I choose the first action, then how can I get the second parameters and the third action parameters in an auto-regressive manner?
2017-10-07 5 42 47

Question2. How can I define 'action space' for A3C for pysc2? (Similar one with Q1) I'm kind of familiar with Discrete action space.

  1. I got the concept of calculating the spatial action policy using 1x1 convolution layer for dimension reduction.
  2. But what the action space would look like? It should not be like tf.int32. Maybe it can be MultiDiscrete action space. [move_screen(categorical int), [[0](categorical int), (x1(int), y1(int))]]
  3. If it is a MultiDiscrete Action Space, then how can I define policy loss function?
2017-10-07 6 10 06

If you know some reference codes on Github, please let me know. or some relevant papers and articles would be welcome.

chris-chris avatar Oct 07 '17 09:10 chris-chris

While my team and I are still figuring this out, you may want to ensure that you fully understand how autoregressive policy outputs work.

Discrete Sequential Prediction of Continuous Actions for Deep RL was a really good paper on understanding how these types of policies are implemented. While I am not too familiar with Multi Discrete action spaces in tensorflow, it's important to take note of the fact that implementations of this type of policy can easily grow infeasible in the number of parameters.

It would be great to get insight from an engineer on the Deepmind team about this issue (seems like lots of people are having it) and hopefully, we can keep this issue thread open for more discussion as we continue to work on this topic

bhairavmehta95 avatar Oct 08 '17 18:10 bhairavmehta95

I am also working on implementing the baseline agent from the release-paper. I think for the mini-games it's enough to set all other arguments as 0 expect the screen and minimap and the paper is clear how to handle those.

pekaalto avatar Oct 09 '17 14:10 pekaalto

I applied ACKTR(A3C) on my Github. https://github.com/chris-chris/pysc2-examples

It is learning but, I think it needs more improvements.

Trained multiple Pi for base_action(524) sub1_action(2) sub2_action(3~10) sub3_action(500) x1, y1, x2, y2 (64)

and multiplied loss functions.

Your feedback is welcomed and appreciated.

chris-chris avatar Oct 09 '17 16:10 chris-chris

@chris-chris I think I know the answer to your Q1. I will refer to the equation numbers in the paper. You simply substitute the pi_theta(a|s) expressions in (1) with the identity in (2). Because all of these are inside a log function, the resulting expression will be a sum, not a product.

Therefore computing gradient of a sum is much easier that of a product. For the policy gradient part, you start with a grad(log(pi_theta(a|s))) and end up with grad(log(pi_theta(a^0|s)) + grad(log(pi_theta(a^1|s)) + ... using simple calculus rules.

If I've got this wrong someone please correct me.

I have two questions myself.

I am interested in how the authors of the paper handled the spatial parameters of actions on output in their baseline agents. In the paper, it is specifically stated that there is only a single channel of 1x1 convolution. How then do you select the second point, which is necessary for performing the select_rect(p1, p2) action? (the logical solution seems to me as using simply 2 spatial output channels, but then what about the reference in the paper?)

I would also like to ask you to elaborate on this sentence from your paper: 'We embed all feature layers containing categorical values into a continuous space, which is equivalent to using a one-hot encoding in the channel dimension followed by a 1 × 1 convolution.'

I kind of understand this but it doesn't seem as equivalent to me. I agree that in the first layer, the categorical inputs get encoded into some real values anyway but they can be encoded by several neurons in a very different manner, depending on the weights. So how am I to understand this concept?

I'd be very grateful for your answers.

avolny avatar Oct 10 '17 15:10 avolny

@chris-chris @avolny They only state that there is a single output channel 1x1 convolution for a spatial output. In my interpretation, there may be multiple spatial and non-spatial outputs.

For example, one way to implement it could be to use one output layer (fully connected or convolutional, depending on the argument type) for each of the argument types listed in https://github.com/deepmind/pysc2/blob/master/pysc2/lib/actions.py#L200, as well as for the function identifier. During sampling, one would then first sample the function id from the predicted function identifier distribution, and then sample from the appropriate argument type distribution, depending on the argument types used in the arguments of this function type.

So you would have a single channel 1x1 conv output for each of the types 'screen', screen2', 'minimap', and the non-spatial arguments are similar. When computing the log probs and entropies, you can mask out the arguments which are not used for a specific action sample.

Refering to the paper, 4.2:

In most of our experiments we found it sufficient to model sub-actions independently, however, we also explored with auto-regressive policies where the ordering was chosen such that we first decide for the function identifier, then all categorical arguments and finally, if relevant, pixel coordinates.

The above should correspond to "modelling sub-actions independently".

@avolny Also, you could look at this issue for the input feature layers: https://github.com/deepmind/pysc2/issues/116.

simonmeister avatar Dec 31 '17 19:12 simonmeister

@avolny By the way, you can have a look at https://github.com/simonmeister/pysc2-rl-agents/blob/master/rl/agents/a2c/agent.py for a example implementation. It predicts the complete action space with all argument types, works fine for MoveToBeacon and i will test other mini games soon.

simonmeister avatar Jan 05 '18 16:01 simonmeister

@chris-chris @simonmeister In origin paper, it implemented spatial policy (64x64?) in network architecture. How do we implement this kind of output layer? just flatten as 64x64 fully connected linear layer?

JIElite avatar Feb 22 '18 17:02 JIElite

@JIElite in the FullyConv agent, the spatial policy should be taken from the convolutional stream, with a 1-channel convolution as output and no further fully-connected layers. This is then interpreted as a probability distribution over all the pixels.

simonmeister avatar Feb 22 '18 20:02 simonmeister