deep_learning_and_the_game_of_go
deep_learning_and_the_game_of_go copied to clipboard
Chapter 13: Possible errors in AlphaGoMCTS agent
- In dlgo.agent.alphago.AlphaGoMCTS we have the policy rollout function in line 142.
def policy_rollout(self, game_state):
for step in range(self.rollout_limit):
if game_state.is_over():
break
move_probabilities = self.rollout_policy.predict(game_state)
encoder = self.rollout_policy.encoder
valid_moves = [m for idx, m in enumerate(move_probabilities)
if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))
max_point = encoder.decode_point_index(max_index)
greedy_move = Move(max_point)
if greedy_move in game_state.legal_moves():
game_state = game_state.apply_move(greedy_move)
next_player = game_state.next_player
winner = game_state.winner()
if winner is not None:
return 1 if winner == next_player else -1
else:
return 0
However while line 148 eliminates the invalid moves, it also removes the indices of those moves, which makes the max_point no longer being the real point with maximum policy probability.
- In dlgo.agent.alphago.AlphaGoNode we have the expand_children function in line 41
def expand_children(self, moves, probabilities):
for move, prob in zip(moves, probabilities):
if move not in self.children:
self.children[move] = AlphaGoNode(probability=prob)
However it does not assign the children with its parent.
My correction:
def expand_children(self, moves, probabilities):
for move, prob in zip(moves, probabilities):
if move not in self.children:
self.children[move] = AlphaGoNode(parent=self, probability=prob)
can you share your trained h5 files?I have no enough condition to train a good agent.
@JingOY0610 thank you.
ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?
ad 2) yes, you're right. mind sending a quick PR? thanks!
@JingOY0610 thank you.
ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?
ad 2) yes, you're right. mind sending a quick PR? thanks!
For ad 1):
So we first predict the probabilities for each move here
move_probabilities = self.rollout_policy.predict(game_state)
move_probabilities is a list of probabilities with each element representing the prob. of that index. But later on we did
valid_moves = [m for idx, m in enumerate(move_probabilities)
if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))
Here we removed some indices by keeping only valid moves indices. The move_probabilities then do not represent the prob. of that index any more because some indices are removed.
Here is an example: Say move_probabilities = [0, 0, 0, 0.01, 0.99, ...., 0] The highest probability index is 4 Say the index 0 is not a valid move. Then valid_moves = [0, 0, 0.01, 0.99, ..., 0] Then we select the max_index to be 3, which is not right.
I also fixed this bug. Would you mind taking a review on my PR? Thanks!
@JingOY0610 got it, you're right. thanks for spotting this - extremely helpful