pysc2 Troubleshooting A3C agent

Hello, as a part of my master thesis, I have been trying to reproduce the results from the DeepMind paper. I have implemented the A3C algorithm and am currently testing it on the MoveToBeacon minigame. However, I am having troubles with learning a good policy. None of my runs have exceeded average reward of 2. However, I have pretty limiting hardware; I am not able to run more than 3600 simulation steps per minute (including the loss function minimisation and having multiple simulations running at the same time, 1 step = 8 actual game steps). So it takes nearly 5 hours of training to get to 1 million simulation steps.

However, I am not sure what I am doing wrong. It is possible that there is somewhere some mistake in my implementation (it should be in accordance with the A3C original paper). It is possible that I have set wrong hyperparameters. It is possible that the learning is so slow that it seems that no learning happens (I usually don't let it run much past 1 million steps). But I can't tell with certainty which one it is until I successfully train an agent.

In the paper, you mention some of the hyper-parameter settings (learning rate from uniform(1e-3,1e-5), entropy loss coef 1e-3, 40 steps unroll of BPTT, 64 threads).

However, you don't mention some of the others, specifically, from what distribution have you sampled the value gradient coefficient and the T_max of A3C (meaning the max number of steps for computing the n-step reward)? How important is the number of parallel agents training? To me, there seems to be little difference in training with 10 agents and with 32 agents.

Also, what would be your recommendation on determining where is the problem with my implementation? How fast should the correct algorithm be expected to find a reasonable policy on the MoveToBeacon minigame? In another issue, someone said that after couple tens of thousands, it should already be good.

Any insight greatly appreciated!

Oct 19 '17 14:10 avolny

I have to say the parallelism is very crucial in A3C.

Oct 20 '17 03:10 pkuso

@Adam-Volny wild guess but one thing what you might want to check is this issue #103. When I tried to use the coordinates in wrong order for action I could also get avg ~2 in beacon (still slightly better than random) no matter what.

However when fixed the coordinate-order issue it worked very fast. In a very much simplified setting the beacon was "solved" in ~6000 episodes. See more explanation here https://github.com/pekaalto/sc2atari

About the number of environments: I suspect that there is not much difference from 10~ to 32 environments in terms of training speed per total number (over all workers) of episodes. Please see https://arxiv.org/pdf/1602.01783.pdf figure 3, where data-efficiency is compared against number of workers. While parallelism is important there seems to be super-linear gains only with less than 10 workers. (of course sc2 might behave differently).

I would also be interested to know more specifically what hyperparameters deepmind used.

Oct 23 '17 12:10 pekaalto

In our A2C re-implementation (https://github.com/simonmeister/pysc2-rl-agents), we are also able to train MoveToBeacon in about 4000 to 8000 episodes for most runs (using 32 parallel environments), with mostly complete inputs and the full action space. Thus, it seems that a correct implementation should be able to learn MoveToBeacon rather quickly.

+1 for the hyperparameters request. It would be great to know the value loss coefficient and the discount factor.

Jan 05 '18 17:01 simonmeister

pysc2 pysc2 copied to clipboard

Troubleshooting A3C agent

pysc2
pysc2 copied to clipboard