alpha-zero-general Agent stops exploration after a number of iterations

I am training the Connect4 game agent for more than 100 iterations. For the last 20 iterations it is not getting smarter. I always get:

 Arena.playGames |################################| (40/40) Eps Time: 10.063s | Total: 0:06:42 | ETA: 0:00:11
NEW/PREV WINS : 0 / 0 ; DRAWS : 40

However, when I play against it, I can pretty easily win. I investigated the learning process further and noticed that the action probabilities at the first level of the tree are always: [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]. The reason is that the action count is e.g. [0, 0, 0, 66, 0, 0]. So as far as I can tell the agent is not exploring anymore and always follows the same paths. So no more exploration, just exploitation. Increasing cpuct doesn't seem to help. It just increases the probability values for each action proportionally.

I think the implementation is missing the dirichlet noise.

Apr 19 '20 19:04 NMO13

Hey @NMO13!

Can you paste your entire config from main? In particular I'm curious about your temperature threshold and numMCTS

Apr 22 '20 03:04 mikhail

Hi @mikhail,

of course. I think I used mainly the defaults:

args = dotdict({
    'numIters': 1000,
    'numEps': 25,              # Number of complete self-play games to simulate during a new iteration.
    'tempThreshold': 15,        #
    'updateThreshold': 0.6,     # During arena playoff, new neural net will be accepted if threshold or more of games are won.
    'maxlenOfQueue': 200000,    # Number of game examples to train the neural networks.
    'numMCTSSims': 25,          # Number of games moves for MCTS to simulate.
    'arenaCompare': 40,         # Number of games to play during arena play to determine if new net will be accepted.
    'cpuct': 1,

    'checkpoint': './temp/',
    'load_model': True,
    'load_folder_file': ('/media/nmo/3E26969E2696572D/Martin/Programmieren/Machine-Learning/reinforcement_learning/alpha_go_zero/temp','temp.pth.tar'),
    'numItersForTrainExamplesHistory': 20,

})

I increased the numMCTSSims to 200 for a couple of iterations but that took too long to train and it also didn't improve the training. I didn't touch the tempThreshold, that stayed 15 all the time. And as mentioned already, I increased cpuct to 10 for a couple of iterations but didn't help either because "bad" states with low propability stay low.

Apr 22 '20 10:04 NMO13

But I added now Dirichlet noise and voila, it learns again.

NEW/PREV WINS : 0 / 0 ; DRAWS : 40 REJECTING NEW MODEL ------ITER 3------ NEW/PREV WINS : 0 / 0 ; DRAWS : 40 REJECTING NEW MODEL ------ITER 4------ . . . NEW/PREV WINS : 4 / 3 ; DRAWS : 33 REJECTING NEW MODEL ------ITER 15------ . . . NEW/PREV WINS : 17 / 3 ; DRAWS : 20 ACCEPTING NEW MODEL ------ITER 19------

The agent got even notably smarter in early stages of the game. I will create a pull request tomorrow. Maybe someone is so kind and can check it.

Apr 22 '20 10:04 NMO13

Another thing worth checking (and this is similar to another thread) is your string representation of the state. If it is inaccurate (imagine ad absurdum example of always returning the same string) then the learning process will save the outcome of the state and ignore learnings. Double check that you have a true representation of the state

Apr 22 '20 15:04 mikhail

I normally train with more self play episodes (100). I am currently training a connect4 model to experiment with solved games, and noticed that my episode time is much shorter (2s). Are you using a GPU? This would speed up training considerably. You can use Google Colab or Paperspace for this. Also what are your NNet.py params?

Apr 23 '20 07:04 goshawk22

Connect 4 is a solved game, so eventually the model should get to a point where it wins every time it plays first. If arenaCompare is set to 40 games, 20 will be won by each model (new and temp).

Apr 23 '20 19:04 goshawk22

@goshawk22 can you please post your entire config from main? How is big is your policy/value loss? How many iterations did you train your model?

Apr 24 '20 23:04 NMO13

I normally train with more self play episodes (100). I am currently training a connect4 model to experiment with solved games, and noticed that my episode time is much shorter (2s). Are you using a GPU? This would speed up training considerably. You can use Google Colab or Paperspace for this. Also what are your NNet.py params?

Yes I have an Nvidia GPU, but one in a laptop so probably not high performance. I forgot to tell that I implemented a Pytorch net for the Connect4 agent. Its pretty the same architecture as the one from Orthello. But maybe that makes a difference, who knows. I will train it again using the Keras implementation, with an epsiode count of 100 as you proposed and one time with and without my Ditrichlet noise implementation. I will post the results when I am finished.

Apr 26 '20 22:04 NMO13

The code to add dirichlet noise is pretty simple, and I agree it is necessary. In the search function of MCTS below:

self.Ps[s], v = self.nnet.predict(canonicalBoard)

I added a boolean dir_noise and a float alpha, and I added the lines:

if dir_noise: self.Ps[s] = (0.75 * self.Ps[s]) + (0.25 * np.random.dirichlet([alpha]) * len(self.Ps[s])))

I did this based on this line from the paper:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s0, specifically P(s, a) = (1 − ε)p + εη, where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

I don't completely understand the mathematics of the dirichlet distribution, but I chose an alpha of 1.4 for connect 4 because of approximately 7 moves instead of 250 in Go. The one thing I am wondering about my code is that I am not sure when to apply it to the prior probabilities generated by the neural network for a board s. Currently I am applying it at all the search board nodes because otherwise only the first move would get the dirichlet noise due to the caching of the prior probabilities and policies.

May 06 '20 19:05 jrbuhl93

I also added dirichlet noise to my local branch in a very similar way as you did. I trained one version of the Connect4 agent with the master branch code for 100 iterations and at the moment I train another version with dirichlet noise for 100 iterations. When its finished training, I let them pit against each other. I will publish the results here. The training will take me around 2-3 days more bc. I have a moderate laptop GPU. But I still think that additional randomness is important bc. otherwise it seems like the agent stops exploring very soon.

May 10 '20 12:05 NMO13

Ok so here are the results: So agent1 has dirichlet noise and was trained on 100 iterations. Agent2 was trained on 100 iterations and has no dirichlet noise. I let them pit against each other for 50 games. The result: (37, 8, 5) Means agent1 won 37 times, agent2 8 times and 5 times draw. So I think adding noise is not a bad idea.

May 16 '20 11:05 NMO13

The code to add dirichlet noise is pretty simple, and I agree it is necessary. In the search function of MCTS below:

self.Ps[s], v = self.nnet.predict(canonicalBoard)

I added a boolean dir_noise and a float alpha, and I added the lines:

if dir_noise: self.Ps[s] = (0.75 * self.Ps[s]) + (0.25 * np.random.dirichlet([alpha]) * len(self.Ps[s])))

I did this based on this line from the paper:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s0, specifically P(s, a) = (1 − ε)p + εη, where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

I don't completely understand the mathematics of the dirichlet distribution, but I chose an alpha of 1.4 for connect 4 because of approximately 7 moves instead of 250 in Go. The one thing I am wondering about my code is that I am not sure when to apply it to the prior probabilities generated by the neural network for a board s. Currently I am applying it at all the search board nodes because otherwise only the first move would get the dirichlet noise due to the caching of the prior probabilities and policies.

@jrbuhl93 I think it should be if dir_noise: self.Ps[s] = (0.75 * self.Ps[s]) + (0.25 * np.random.dirichlet([alpha] * len(self.Ps[s])))

redundant ) after [alpha]

May 22 '20 05:05 zenghsh3

@jrbuhl93 how to achieve that noise is added to the root node only as it said in the paper?

May 22 '20 06:05 evg-tyurin

@jrbuhl93 how to achieve that noise is added to the root node only as it said in the paper?

If you look at my pull request it does this. NMO13 (Martin), also has a different kind of implementation that only affects the root node. He is currently testing to see which one gives better results for connect 4.

May 22 '20 14:05 jrbuhl93

alpha-zero-general alpha-zero-general copied to clipboard

Agent stops exploration after a number of iterations

alpha-zero-general
alpha-zero-general copied to clipboard