reinforcement-learning
reinforcement-learning copied to clipboard
Provided policy_improvement() solution initializes values to zero for each iteration
Provided solution does not follow the pseudocode on p. 102 exactly. It initializes policy evaluation with zeros each time, even though the book says: "Note that each policy evaluation, itself an iterative computation, is started with the value function for the previous policy." This change does not provide improvement in the "gridworld" example, but may speed up convergence in more complex examples.
It makes sense to change policy_eval
signature to accept initial value for V
, something like this:
def policy_eval(policy, env, discount_factor=1.0, theta=0.00001, V_init=np.zeros(env.nS)):
...
V_init: initial value function vector.
...
V = V_init
and change policy_improvement
to pass previous value to policy_eval
.
See also related issues about another bug in the function (#203) and its naming (#202).
Yes you are right. I will post my code in next comment.
def policy_eval(policy, env,V = np.zeros(env.nS), discount_factor=1.0, theta=0.00001): """ Evaluate a policy given an environment and a full description of the environment's dynamics.
Args:
policy: [S, A] shaped matrix representing the policy.
env: OpenAI env. env.P represents the transition probabilities of the environment.
env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
env.nS is a number of states in the environment.
env.nA is a number of actions in the environment.
theta: We stop evaluation once our value function change is less than theta for all states.
discount_factor: Gamma discount factor.
Returns:
Vector of length env.nS representing the value function.
"""
# Start with a random (all 0) value function
while True:
delta = 0
# For each state, perform a "full backup"
for s in range(env.nS):
v = 0
# Look at the possible next actions
for a, action_prob in enumerate(policy[s]):
# For each action, look at the possible next states...
for prob, next_state, reward, done in env.P[s][a]:
# Calculate the expected value
v += action_prob * prob * (reward + discount_factor * V[next_state])
# How much our value function changed (across any states)
delta = max(delta, np.abs(v - V[s]))
V[s] = v
# Stop evaluating once our value function change is below a threshold
print(V)
if delta < theta:
break
else:
return np.array(V)
return np.array(V)
def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0): """ Policy Improvement Algorithm. Iteratively evaluates and improves a policy until an optimal policy is found.
Args:
env: The OpenAI envrionment.
policy_eval_fn: Policy Evaluation function that takes 3 arguments:
policy, env, discount_factor.
discount_factor: gamma discount factor.
Returns:
A tuple (policy, V).
policy is the optimal policy, a matrix of shape [S, A] where each state s
contains a valid probability distribution over actions.
V is the value function for the optimal policy.
"""
# Start with a random policy
policy = np.ones([env.nS, env.nA]) / env.nA
#Finds the value function for the random policy
V = policy_eval(policy, env)
while True:
#Updates the new policy from the value function
policy_stable = True
#For evry state
for s in range(env.nS):
#Get the present policy
old_action = policy[s].copy()
#Variable for best action
best_action = np.zeros([env.nA])
for a, action_prob in enumerate(policy[s]):
# For each action, look at the possible next states...
for prob, next_state, reward, done in env.P[s][a]:
# Calculate the value function for every action
best_action[a] = prob * (reward + discount_factor * V[next_state])
#Pick the best actions
best_action = (best_action == best_action.max()).astype(int)
#Change the best actions in probabilty
policy[s] = [float(i)/sum(best_action ) for i in best_action]
#Change the end game in 0 probabilty
if s == 0 or s == env.nS-1:
policy[s] = 0
#Verifiy if policy improves
if (old_action != policy[s]).all():
policy_stable = False
#If policy dosent improve return best policy and V
if policy_stable:
return (policy,V)
break
#Policy evaluation with next policy
V = policy_eval(policy, env,V)
return policy, np.zeros(env.nS)