alpha-zero-general icon indicating copy to clipboard operation
alpha-zero-general copied to clipboard

dirichlet noise added to prior probabilities during self play

Open jrbuhl93 opened this issue 4 years ago • 12 comments

I don't completely understand the mathematics of the dirichlet distribution, but I arbitrarily chose an alpha of 0.6 as the default value for dirichletAlpha in main.py because that seems approximately right for othello. The one thing I am wondering about my code is that I am not sure when to apply it to the prior probabilities generated by the neural network for a board s. Currently I am applying it at all the search board nodes because otherwise only the first move would get the dirichlet noise due to the caching of the prior probabilities and policies.

But the paper seems to imply it should only be applied at the root node:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s0, specifically P(s, a) = (1 − ε)p + εη, where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

Thoughts and feedback are appreciated.

jrbuhl93 avatar May 06 '20 20:05 jrbuhl93

Thanks for sharing your code! Some remarks:

  • The paper states: "Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node..." So the noise should only be added to the root node
  • You should only add noise to valid moves. So something more like np.random.dirichlet([self.args.dirichletAlpha] * np.count_nonzero(valids))

NMO13 avatar May 10 '20 12:05 NMO13

I made edits to my code to add in your remarks. Let me know if you believe it is behaving properly now.

jrbuhl93 avatar May 10 '20 14:05 jrbuhl93

I was able to get a Connect4 agent playing perfectly as far as I have explored. It has 100% wins moving first in the evaluation loop. It also does not make any moves that would result in a loss or a draw when I play against it with the aid of https://connect4.gamesolver.org/. I used the updated dirichlet algorithm I made with NMO13's input.

Other changes to achieve this: -I added code that would randomly flip the board during self play like what the alpha go zero paper describes. -I changed the neural network architecture to be the stack described in the alpha go zero paper, but I only used 6 residual blocks instead of 19 or 39 to save on computation. -I added a function to convert the input to the neural network to be 6x7x2. The first 6x7 is for the current players pieces denoted by ones, and the second is for the second players pieces also denoted by ones. I did this because I believe overloading the network to detect both 1s and -1s may inhibit its ability to process a position. This matches with the alpha zero paper having separate layers in the stack for each player's pieces and will also scale to games with multiple types of pieces which I plan to now implement.

My config was as follows: 'numIters': 30, 'numEps': 100, 'tempThreshold': 10, 'updateThreshold': 0.55, 'maxlenOfQueue': 200000, 'numMCTSSims': 150, 'arenaCompare': 6, 'cpuct': 1.25, 'dirichletAlpha': 1.4,

I don't know the exact iterations due to having to restart the code because of memory leaks in my current implementation of self play parallelization, but it was probably ~75.

jrbuhl93 avatar May 14 '20 00:05 jrbuhl93

I believe I spoke too soon. The second player has begun playing alternative moves, allowing the training to continue for new types of positions.

jrbuhl93 avatar May 14 '20 16:05 jrbuhl93

I added some comments to the code.

Other changes to achieve this: -I added code that would randomly flip the board during self play like what the alpha go zero paper describes. -I changed the neural network architecture to be the stack described in the alpha go zero paper, but I only used 6 residual blocks instead of 19 or 39 to save on computation. -I added a function to convert the input to the neural network to be 6x7x2. The first 6x7 is for the current players pieces denoted by ones, and the second is for the second players pieces also denoted by ones. I did this because I believe overloading the network to detect both 1s and -1s may inhibit its ability to process a position. This matches with the alpha zero paper having separate layers in the stack for each player's pieces and will also scale to games with multiple types of pieces which I plan to now implement.

Sounds interesting but I would suggest that in a different PR bc. otherwise it gets too messy.

And please resolve the conflicts, thanks. ;-)

NMO13 avatar May 16 '20 12:05 NMO13

Thank you once again for your comments. I removed the bug (quite a large one) caused by calling self.dirichlet_noise instead of dirichlet_noise in the code I committed to my pull request. I also fixed this in the code for the model I am training now. I saw the results of your dirichlet_noise experiment and am glad the evidence matches our intuitions. How similar is your code for implementing dirichlet noise?

Also let me know what you think about the i==0 and leaf node situations.

jrbuhl93 avatar May 16 '20 15:05 jrbuhl93

How similar is your code for implementing dirichlet noise?

Pretty similar. The main difference is that I don't override Ps, as already mentioned. I do:

        p = np.copy(self.Ps[s])
        if self.add_noise and depth == 0:
            self.add_dirichlet(p, valids)

Notice the np.copy. And then I do:

if (s,a) in self.Qsa:
                    u = self.Qsa[(s,a)] + self.args.cpuct*p[a]*math.sqrt(self.Ns[s])/(1+self.Nsa[(s,a)])
                else:
                    u = self.args.cpuct*p[a]*math.sqrt(self.Ns[s] + EPS)     # Q = 0 ?

That should assure that the probabilities don't drift too far away.

NMO13 avatar May 16 '20 18:05 NMO13

I think I like my implementation of dirichlet noise more, because I think it is more likely deep mind implemented it like I did based on my understanding of the paper. It seems to me to be better to search a path repeatedly in depth (based on if that path got a big increase in P from the dir noise). If you randomly assign dir noise for each simulation, I think the benefit  will be more like adding uniform noise because the variation of the noise will be less with the repeated sampling.

Do you agree or disagree?

jrbuhl93 avatar May 18 '20 20:05 jrbuhl93

Ok so my inutition was to add, as you already mentioned, more like unifrom noise. Just to make sure that the other paths are not "starving". I am just thinking that your solution of repeatedly overriding Ps could mess the probabilities too much up. But of course, I might be wrong and your solution works even better.

NMO13 avatar May 19 '20 10:05 NMO13

I am currently training your version. I am curious how it will perform against my version.

NMO13 avatar May 20 '20 08:05 NMO13

I am currently training your version. I am curious how it will perform against my version.

To minimize variables, you should use the dirichlet alpha you used to train your other model. I got very far with 1.4, but it got stuck for ~15 iterations on all ties. I just dropped the alpha to 0.9, so hopefully this is enough variation for it to find the optimal solution. What alpha did you use when you trained?

jrbuhl93 avatar May 20 '20 15:05 jrbuhl93

I used 0.6. And for the first 100 iterations it finds a better model every ~4 iterations. So here are some results. I let them pit against each other for 50 games.

Comparing your version vs mine: (13, 24, 13) -> your version is stronger than mine.

Comparing your version vs the original one: (10, 32, 8) -> your version is much stronger

Conclusion: Of course, I just tested your algo for one only game but I am highly confident that it works for other games equally well. You did well. I would merge your algorithm into the master. @suragnair what do you think?

NMO13 avatar May 24 '20 14:05 NMO13