reinforcement-learning
reinforcement-learning copied to clipboard
Is a line missing in 'MC Control with Epsilon-Greedy Policies Solution.ipynb'?
In the function mc_control_epsilon_greedy
:
# Find all (state, action) pairs we've visited in this episode
# We convert each state to a tuple so that we can use it as a dict key
sa_in_episode = set([(tuple(x[0]), x[1]) for x in episode])
for state, action in sa_in_episode:
sa_pair = (state, action)
# Find the first occurance of the (state, action) pair in the episode
first_occurence_idx = next(i for i,x in enumerate(episode)
if x[0] == state and x[1] == action)
# Sum up all rewards since the first occurance
G = sum([x[2]*(discount_factor**i) for i,x in enumerate(episode[first_occurence_idx:])])
# Calculate average return for this state over all sampled episodes
returns_sum[sa_pair] += G
returns_count[sa_pair] += 1.0
Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]
# The policy is improved implicitly by changing the Q dictionary
return Q, policy
I think a line should be added upon the last line:
Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]
# The policy is improved implicitly by changing the Q dictionary
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
return Q, policy
Otherwise the policy will not upgrade.
@Ritz111 No, it'll update. Actually, the policy is updating as Q values are updating because it is fetching the next action according to the current Q values.