reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

Is a line missing in 'MC Control with Epsilon-Greedy Policies Solution.ipynb'?

Open Ritz111 opened this issue 5 years ago • 1 comments

In the function mc_control_epsilon_greedy:

        # Find all (state, action) pairs we've visited in this episode
        # We convert each state to a tuple so that we can use it as a dict key
        sa_in_episode = set([(tuple(x[0]), x[1]) for x in episode])
        for state, action in sa_in_episode:
            sa_pair = (state, action)
            # Find the first occurance of the (state, action) pair in the episode
            first_occurence_idx = next(i for i,x in enumerate(episode)
                                       if x[0] == state and x[1] == action)
            # Sum up all rewards since the first occurance
            G = sum([x[2]*(discount_factor**i) for i,x in enumerate(episode[first_occurence_idx:])])
            # Calculate average return for this state over all sampled episodes
            returns_sum[sa_pair] += G
            returns_count[sa_pair] += 1.0
            Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]
        
        # The policy is improved implicitly by changing the Q dictionary
    
    return Q, policy

I think a line should be added upon the last line:

            Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]
        
        # The policy is improved implicitly by changing the Q dictionary
        policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)

    return Q, policy

Otherwise the policy will not upgrade.

Ritz111 avatar Jan 23 '20 04:01 Ritz111

@Ritz111 No, it'll update. Actually, the policy is updating as Q values are updating because it is fetching the next action according to the current Q values.

makaveli10 avatar Apr 23 '20 06:04 makaveli10