deep_learning_and_the_game_of_go Chapter 13: Possible errors in AlphaGoMCTS agent

In dlgo.agent.alphago.AlphaGoMCTS we have the policy rollout function in line 142.

def policy_rollout(self, game_state):
    for step in range(self.rollout_limit):
        if game_state.is_over():
            break
        move_probabilities = self.rollout_policy.predict(game_state)
        encoder = self.rollout_policy.encoder
        valid_moves = [m for idx, m in enumerate(move_probabilities) 
                        if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
        max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))
        max_point = encoder.decode_point_index(max_index)
        greedy_move = Move(max_point)
        if greedy_move in game_state.legal_moves():
            game_state = game_state.apply_move(greedy_move)
    next_player = game_state.next_player
    winner = game_state.winner()
    if winner is not None:
        return 1 if winner == next_player else -1
    else:
        return 0

However while line 148 eliminates the invalid moves, it also removes the indices of those moves, which makes the max_point no longer being the real point with maximum policy probability.

In dlgo.agent.alphago.AlphaGoNode we have the expand_children function in line 41

def expand_children(self, moves, probabilities):
    for move, prob in zip(moves, probabilities):
        if move not in self.children:
            self.children[move] = AlphaGoNode(probability=prob)

However it does not assign the children with its parent.

My correction:

def expand_children(self, moves, probabilities):
    for move, prob in zip(moves, probabilities):
        if move not in self.children:
            self.children[move] = AlphaGoNode(parent=self, probability=prob)

Mar 18 '20 23:03 JingOY0610

can you share your trained h5 files?I have no enough condition to train a good agent.

Mar 20 '20 13:03 wang869

@JingOY0610 thank you.

ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?

ad 2) yes, you're right. mind sending a quick PR? thanks!

Mar 28 '20 12:03 maxpumperla

@JingOY0610 thank you.

ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?

ad 2) yes, you're right. mind sending a quick PR? thanks!

For ad 1): So we first predict the probabilities for each move here move_probabilities = self.rollout_policy.predict(game_state) move_probabilities is a list of probabilities with each element representing the prob. of that index. But later on we did

valid_moves = [m for idx, m in enumerate(move_probabilities) 
        if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))

Here we removed some indices by keeping only valid moves indices. The move_probabilities then do not represent the prob. of that index any more because some indices are removed.

Here is an example: Say move_probabilities = [0, 0, 0, 0.01, 0.99, ...., 0] The highest probability index is 4 Say the index 0 is not a valid move. Then valid_moves = [0, 0, 0.01, 0.99, ..., 0] Then we select the max_index to be 3, which is not right.

I also fixed this bug. Would you mind taking a review on my PR? Thanks!

Apr 08 '20 21:04 JingOY0610

@JingOY0610 got it, you're right. thanks for spotting this - extremely helpful

Apr 09 '20 09:04 maxpumperla

deep_learning_and_the_game_of_go deep_learning_and_the_game_of_go copied to clipboard

Chapter 13: Possible errors in AlphaGoMCTS agent

deep_learning_and_the_game_of_go
deep_learning_and_the_game_of_go copied to clipboard