open_spiel
open_spiel copied to clipboard
AlphaZero for Backgammon
Hi, I am trying to create a good Ai for the backgammon game, as there are no examples for it. I tried pure PPO, but it did not give very good results. I was thinking to apply the AlphaZero algorithm, but realized it is only for deterministic games. Is there are inherent reason for this? I see that it currently does not handle ChanceNode's correctly but that could be changed rather easily right?
I would appreciate any feedback in this area!
I am actually surprised to discover this because our MCTS implementations all handle chance nodes.
@mlschmidt366 also implied that he had to do non-trivial changes to get games with chance working in AlphaZero recently, and I was surprised when he said it too.
Can you explain more about why our AlphaZero does not support games with chance nodes? I am pretty sure that our original implementation didn't test on those games, but I'm still surprised that it's not supported out-of-the-box (since our MCTS implementations support these types of games, and it's the only place i in the entire algorithm that would need to know about how to handle chance nodes...)
@tewalds, any opinion on this?
I think we should definitely should be possible to run AlphaZero on games with chance out-of-the-box, and I would be happy to fix this... assuming it's as easy as I think it is ;-)
Also: which version of AZ are you using? C++ based on LibTorch or the pure python one?
I am using the pure python based version (https://github.com/deepmind/open_spiel/tree/master/open_spiel/python/algorithms/alpha_zero)
I see some reasons for why it does not work out of the box. Firstly it flat out rejects using non deterministic games. https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/alpha_zero/alpha_zero.py#L503
That is easy to just comment out though. The next issue is it does not have any logic to deal with chance nodes. https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/alpha_zero/alpha_zero.py#L209
I think a correct logic is implemented for MCTS https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/mcts.py#L326
Finally, while RandomRolloutEvaluator can deal with chance_nodes in its evaluate function, AlphaZeroEvaluator can not. Therefore mcts_search calling AlphaZeroEvaluator.evaluate with a chance node is also not okay https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/mcts.py#L405
This also means _apply_tree_policy potentially returning a chance node as working state is also a problem. https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/mcts.py#L345
Ok great, this should be easy to fix. Would you be willing to submit a PR if I walk you through what needs to be done?
I would happy to do it but with my current load it would be weeks before I get to it.
Yeah sure I could give it a try, though I would be only fixing the python version in this case right?
Yep, whichever one you want to use. We can fix the other ones similarly later when we find the time. Thanks! I will write more in about 90 minutes.
Well the first one is the AlphaZero evaluator should be modified: if called on a chance node, then sample chance events until a normal node (or terminal) is reached, and call the vale function there instead.
Scratch that. There is a better way. We should add an option to MCTS to never stop expanding at chance nodes (ie. keep expanding the tree if the newly added node is a chance node until you get a decision node or terminal). This way the evaluator will never be called at chance nodes. Then we should enable this option when running MCTS via AlphaZero.
The logic in _play_game method is simple: if you encounter a chance node: sample an outcome, apply it, and continue (don't invoke the bots).
I have been thinking if there is also a significant limitation here
https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/mcts.py#L311
Legal actions, and therefore children nodes are computed only once per node, but in a non-deterministic game these are not the same every time that node is visited, but depend on the previous chance node. E.g. for backgammon, just because an action was valid for the dice 1,1, does not mean it will be for a dice of 2,2. Not sure what the best way is to deal with this conceptually?
That is not be a problem because the current roll is part of the state (e.g. node). So LegalActions will return a different set of actions when the roll is 1,1 than when it's 2,2 and those two different states would represented as different nodes in the MCTS tree. If that were not the case, then the MCTS would not work on games with chance nodes at all, but you can verify that it does by running the mcts_example.py on backgammon.
Scratch that. There is a better way. We should add an option to MCTS to never stop expanding at chance nodes (ie. keep expanding the tree if the newly added node is a chance node until you get a decision node or terminal). This way the evaluator will never be called at chance nodes. Then we should enable this option when running MCTS via AlphaZero.
Here https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/mcts.py#L313
The evaluator is called during the expansion to get the legal actions. The prior function of the random evaluator correctly deals with this https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/python/algorithms/mcts.py#L78
Alphazero prior does not. Should I just similarly just return the state chance_outcomes for chance nodes in alpha zero?
Side question: Should the code simply run on the GPU if tensorflow with gpu support is installed? Or is this not currently supported?
I gave it a shot here https://github.com/deepmind/open_spiel/pull/776. Please let me know
Nice!! Thanks. Is it running for you? Can you try it on a very simple game with chance (e.g. pig)?
It is definitely running for me. I just tried it for pig, and I got the following output. Not sure how I should interpret loss only falling at the beginning and policy loss staying the same though.
[2022-01-27 10:56:23.728] Model type: mlp(128, 2) [2022-01-27 10:56:23.729] Model size: 89351 variables [2022-01-27 10:56:23.938] Initial checkpoint: alphazero_output/checkpoint-0 [2022-01-27 10:57:57.177] Step: 1 [2022-01-27 10:57:57.177] Collected 21852 states from 184 games, 142.6 states/s. 35.7 states/(sactor), game length: 118.8 [2022-01-27 10:57:57.177] Buffer size: 21852. States seen: 21852 [2022-01-27 10:57:58.360] Losses(total: 1.665, policy: 0.686, value: 0.952, l2: 0.027) [2022-01-27 10:59:15.205] Step: 2 [2022-01-27 10:59:15.205] Collected 21895 states from 179 games, 280.6 states/s. 70.2 states/(sactor), game length: 122.3 [2022-01-27 10:59:15.205] Buffer size: 43747. States seen: 43747 [2022-01-27 10:59:16.942] Losses(total: 1.444, policy: 0.686, value: 0.734, l2: 0.025) [2022-01-27 11:00:47.319] Step: 3 [2022-01-27 11:00:47.319] Collected 21845 states from 173 games, 237.2 states/s. 59.3 states/(sactor), game length: 126.3 [2022-01-27 11:00:47.319] Buffer size: 65536. States seen: 65592 [2022-01-27 11:00:50.182] Losses(total: 1.411, policy: 0.686, value: 0.702, l2: 0.022) [2022-01-27 11:02:13.711] Step: 4 [2022-01-27 11:02:13.711] Collected 21945 states from 185 games, 254.0 states/s. 63.5 states/(sactor), game length: 118.6 [2022-01-27 11:02:13.711] Buffer size: 65536. States seen: 87537 [2022-01-27 11:02:16.818] Losses(total: 1.433, policy: 0.685, value: 0.727, l2: 0.021) [2022-01-27 11:03:41.190] Step: 5 [2022-01-27 11:03:41.190] Collected 21900 states from 180 games, 250.3 states/s. 62.6 states/(sactor), game length: 121.7 [2022-01-27 11:03:41.190] Buffer size: 65536. States seen: 109437 [2022-01-27 11:03:43.995] Losses(total: 1.475, policy: 0.686, value: 0.768, l2: 0.021) [2022-01-27 11:05:03.701] Step: 6 [2022-01-27 11:05:03.701] Collected 21894 states from 173 games, 265.3 states/s. 66.3 states/(sactor), game length: 126.6 [2022-01-27 11:05:03.701] Buffer size: 65536. States seen: 131331 [2022-01-27 11:05:06.690] Losses(total: 1.482, policy: 0.686, value: 0.776, l2: 0.021) [2022-01-27 11:06:33.634] Step: 7 [2022-01-27 11:06:33.634] Collected 21878 states from 169 games, 243.3 states/s. 60.8 states/(sactor), game length: 129.5 [2022-01-27 11:06:33.634] Buffer size: 65536. States seen: 153209 [2022-01-27 11:06:37.657] Losses(total: 1.465, policy: 0.686, value: 0.758, l2: 0.021) [2022-01-27 11:08:02.932] Step: 8 [2022-01-27 11:08:02.932] Collected 21874 states from 169 games, 245.0 states/s. 61.2 states/(sactor), game length: 129.4 [2022-01-27 11:08:02.933] Buffer size: 65536. States seen: 175083 [2022-01-27 11:08:05.736] Losses(total: 1.450, policy: 0.686, value: 0.743, l2: 0.021)
Cool! As long as the losses are roughly going down, should be an indication. You can play against your checkpoints if you want, using this: https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/examples/mcts.py (might take a while before they get good).
If you run on backgammon, I'd be curious about how well it does!
For better anslysis of the runs there are tools to analyze what is going on, visualize losses etc. Take a look at the full docs on how to do that: https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
I played against it after a while but it was hard to judge how good it was as I am not familiar with this game. But I let it play against a random agent where it won 97% of the games. Though against the generic MCTS agent it only won 36% of the games.

Ok well that's an indication that it's working. 25 steps is not very long.. beating vanilla MCTS 36% is already good with that amount of straining. You'll need a lot more before it gets any good by human standards. This is also a very stochastic game, so you may have to wait longer than usual (or need more simulations per search).
The policy loss does not seem like it's going down at all from the graph and your text output above. That's odd. @tewalds, any idea what be going on?
To clarify, I am using the alpha_zero.py example script (https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/examples/alpha_zero.py). But I changed connect_four to pig, max_simulations to 30 from 300 and network type to MLP with 2 layers and hidden size of 128.
Ok I'm stuck on the policy loss not changing at all -- either there's a bug in the logging or the policy is legitimately not getting trained.
Can I ask you to run it quickly on Tic-Tac-Toe and show us the graphs? We know it should be working there, and that would tell us that maybe it's an interaction in the way the games with chance are being handled.
Scratch that, can you try Connect Four (with the default config)? We have old graphs for that and we know what they should look like). Thanks
Edit: The policy loss should be changing every step (not by much, but you should see it in the textual output), so you don't have to wait long.
I had already started running it on backgammon, and the policy loss is falling there.
[2022-01-27 14:49:14.120] Step: 1 [2022-01-27 14:49:14.120] Collected 21873 states from 180 games, 49.2 states/s. 24.6 states/(sactor), game length: 121.5 [2022-01-27 14:49:14.120] Buffer size: 21873. States seen: 21873 [2022-01-27 14:49:18.293] Losses(total: 3.257, policy: 2.470, value: 0.750, l2: 0.037) [2022-01-27 14:55:15.906] Step: 2 [2022-01-27 14:55:15.906] Collected 21870 states from 240 games, 60.5 states/s. 30.2 states/(sactor), game length: 91.1 [2022-01-27 14:55:15.906] Buffer size: 43743. States seen: 43743 [2022-01-27 14:55:24.216] Losses(total: 3.014, policy: 2.351, value: 0.630, l2: 0.033) [2022-01-27 15:01:31.573] Step: 3 [2022-01-27 15:01:31.573] Collected 21898 states from 188 games, 58.3 states/s. 29.1 states/(sactor), game length: 116.5 [2022-01-27 15:01:31.573] Buffer size: 65536. States seen: 65641 [2022-01-27 15:01:43.690] Losses(total: 2.902, policy: 2.313, value: 0.557, l2: 0.032) [2022-01-27 15:07:53.268] Step: 4 [2022-01-27 15:07:53.268] Collected 21891 states from 190 games, 57.4 states/s. 28.7 states/(sactor), game length: 115.2 [2022-01-27 15:07:53.268] Buffer size: 65536. States seen: 87532 [2022-01-27 15:08:04.811] Losses(total: 2.813, policy: 2.188, value: 0.593, l2: 0.032)
I am running it on connect_four right now
Nice! Ok, I think it's actually working. I think probably what could be happening in pig is that 30 simulations per search is too small, and you're not getting a sufficiently varying policy from uniform random (because the domain is so noisy). Could also be that the exploration constant doesn't make sense for a game with only 2 actions. If Connect Four looks good then I'd say everything is probably fine.
btw in Backgammon you might need more simulations (you might need 500+).. it's also quite noisy, and it has way more actions per state. Also IIRC we used the Tesauro encoding of the state, which I believe was not convolutional. There might be other encodings of the observations that might perform better, but it'd be pretty cool to already reproduce the Tesauro result.
Would be if we could hook up gnubg as an external bot to tests against! Do you know if there is a standard communication protocol for backgammon like gtp for go and uci for chess?
Connect Four might take some time, I will report when I have the results.
I will increase simulations for backgammon, and for now it is not a convolutional representation as you said. Do you think good performance is achievable in a realistic time using python alpha_zero? I also considered switching to Libtorch or tensorflow_CC based alpha_zero versions but I am not sure. Especially lib torch version only supports resnet so the representation would need to be convolutional. Or I would need to add a MLP for lib torch.
After getting good results, hooking up something like gnubg would be certainly very interesting. I am not an expert in the field of backgammon, but so far I have not seen anything like uci for chess.
Connect Four might take some time, I will report when I have the results.
Oh you can stop it after the 2nd or 3rd steps if you see the policy loss changing. I have no reason to think there's a bug at this point. I think the settings for pig would need tuning.
I will increase simulations for backgammon, and for now it is not a convolutional representation as you said. Do you think good performance is achievable in a realistic time using python alpha_zero? I also considered switching to Libtorch or tensorflow_CC based alpha_zero versions but I am not sure. Especially lib torch version only supports resnet so the representation would need to be convolutional. Or I would need to add a MLP for lib torch.
On timing.. it's hard to say... are you using a GPU? The python version was meant more as an example for small games. On the other hand, it was a game that got to super-human level in 1992 with much simpler algorithms. My guess is it would probably get to a decent level fairly fast.. maybe a day or so of training once you find good parameters. To get to competitive level would probably require at least using the C++ version.. and might require a higher-performance distributed implementation. Really depends how far you want to go.
I would be surprised (again..! :-p) to find that we don't support MLPs in the libtorch AlphaZero? Do you have any reason to think that?
By the way, the Tensorflow_CC does not work externally.. we tried really hard, but could not get it to work in the end. So you'd have to use the Libtorch one. I know at least one student using it and it seems to work fairly well for their research purposes.
Connect Four might take some time, I will report when I have the results.
Oh you can stop it after the 2nd or 3rd steps if you see the policy loss changing. I have no reason to think there's a bug at this point. I think the settings for pig would need tuning.
I will increase simulations for backgammon, and for now it is not a convolutional representation as you said. Do you think good performance is achievable in a realistic time using python alpha_zero? I also considered switching to Libtorch or tensorflow_CC based alpha_zero versions but I am not sure. Especially lib torch version only supports resnet so the representation would need to be convolutional. Or I would need to add a MLP for lib torch.
On timing.. it's hard to say... are you using a GPU? The python version was meant more as an example for small games. On the other hand, it was a game that got to super-human level in 1992 with much simpler algorithms. My guess is it would probably get to a decent level fairly fast.. maybe a day or so of training once you find good parameters. To get to competitive level would probably require at least using the C++ version.. and might require a higher-performance distributed implementation. Really depends how far you want to go.
I would be surprised (again..! :-p) to find that we don't support MLPs in the libtorch AlphaZero? Do you have any reason to think that?
Haha well yes. In the original PR https://github.com/deepmind/open_spiel/pull/319 it is mentioned that "Unlike the TensorFlow Python AlphaZero, there is only the option of a residual network model so far." Also in the official example https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/examples/alpha_zero_torch_example.cc#L29
it is stated that only resnet is supported. Also finally I don't see any MLP code in https://github.com/deepmind/open_spiel/blob/master/open_spiel/algorithms/alpha_zero_torch/model.h
As this is a hobby project I only have access to 4 cores and a nvidia 1070, so I think for me it would be enough if it could utilize the GPU. If we could add a MLP to Libtorch implementation, I would consider using it. I would also try to make the same non-deterministic game changes I did for the python version to that one as well. I might also find some time to implement MLP myself, but can't guarantee this right now.
I would be surprised (again..! :-p) to find that we don't support MLPs in the libtorch AlphaZero? Do you have any reason to think that?
Haha well yes. In the original PR #319 it is mentioned that "Unlike the TensorFlow Python AlphaZero, there is only the option of a residual network model so far." Also in the official example
https://github.com/deepmind/open_spiel/blob/f522d174a1e2e8fbaf2007294985869d6e520669/open_spiel/examples/alpha_zero_torch_example.cc#L29
it is stated that only resnet is supported. Also finally I don't see any MLP code in https://github.com/deepmind/open_spiel/blob/master/open_spiel/algorithms/alpha_zero_torch/model.h
@christianjans, it would be nice if the Libtorch AlphaZero supported simpler networks, like MLPs.
I assume adding this would be very easy to do. @egeozsoy, would you also be willing to submit a PR to add this support if the original contributor described what you'd have to do, as we did with games with chance?
I will check with the other student currently using it. Maybe they've added this themself already.
Sure, here though I have less experience (never used Libtorch just pytorch) so I might have more questions.