tic_tac_toe
tic_tac_toe copied to clipboard
Teaching the computer to play Tic Tac Toe using Deep Q Networks
Tic Tac Toe played by Double Deep Q-Networks
This repository contains a (successful) attempt to train a Double Deep Q-Network (DDQN) agent to play Tic-Tac-Toe. It learned to:
- Distinguish valid from invalid moves
- Comprehend how to win a game
- Block the opponent when poses a threat
Key formulas of algorithms used:
Double Deep Q-Networks:
Based on the DDQN algorithm by Van-Hasselt et al. [1]. The cost function used is:
Where θ represents the trained Q-Network and ϑ represents the semi-static Q-Target network.
The Q-Target update rule is based on the DDPG algorithm by Lillicrap et al. [2] :
for some 0 <= τ <= 1.
Maximum Entropy Learning:
Based on a paper by Haarnoja et al.[3] and designed according to
a blog-post by BAIR[4].
Q-Values are computed using the Soft Bellman Equation:
Trained models:
Two types of agents were trained:
- a regular DDQN agent
- an agent which learns using maximum entropy. They are named 'Q' and 'E' respectively.
Both models use a cyclic memory buffer as their experience-replay memory.
All pre-trained models are found under the models/
directory, where several trained models can be found for each
variant. Q files refer to DDQN models and E files refer to DDQN-Max-Entropy models.
Do it yourself:
The main.py
holds several useful functions. See doc-strings for more details:
-
train
will initiate a single training process. It will save the weights and plots graphs. Using the current settings, training took me around 70 minutes on a 2018 MacBook Pro -
multi_train
will train several DDQN and DDQN-Max-Entropy models -
play
allows a human player to play against a saved model -
face_off
can be used to compare models by letting them play against each other
The DeepQNetworkModel
class can be easily configured using these parameters (among others):
-
layers_size
: set the number and size of the hidden layers of the model (only fully-connected layers are supported) -
memory
: set memory type (cyclic buffer or reservoir sampling) -
double_dqn
: set whether to use DDQN or a standard DQN -
maximize_entropy
: set whether to use maximum entropy learning or not
See the class doc-string for all possible parameters.
Related blogposts:
- Read about where I got stuck when developing this code on "Lessons Learned from Tic-Tac-Toe: Practical Reinforcement Learning Tips"
- Read about the E Max-Entropy models on "Open Minded AI: Improving Performance by Keeping All Options on the Table"
References:
- Hado van Hasselt et al., Deep Reinforcement Learning with Double Q-learning
- Lillicrap et al. , Continuous control with deep reinforcement learning
- Haarnoja et al., Reinforcement Learning with Deep Energy-Based Policies
- Tang & Haarnoja, Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning (blogpost)