reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

Provided policy_improvement() solution initializes values to zero for each iteration

Open link2xt opened this issue 5 years ago • 2 comments

Provided solution does not follow the pseudocode on p. 102 exactly. It initializes policy evaluation with zeros each time, even though the book says: "Note that each policy evaluation, itself an iterative computation, is started with the value function for the previous policy." This change does not provide improvement in the "gridworld" example, but may speed up convergence in more complex examples.

It makes sense to change policy_eval signature to accept initial value for V, something like this:

def policy_eval(policy, env, discount_factor=1.0, theta=0.00001, V_init=np.zeros(env.nS)):
...
        V_init: initial value function vector.
...
    V = V_init

and change policy_improvement to pass previous value to policy_eval.

See also related issues about another bug in the function (#203) and its naming (#202).

link2xt avatar Jun 23 '19 22:06 link2xt

Yes you are right. I will post my code in next comment.

radumihai28 avatar Apr 22 '20 19:04 radumihai28

def policy_eval(policy, env,V = np.zeros(env.nS), discount_factor=1.0, theta=0.00001): """ Evaluate a policy given an environment and a full description of the environment's dynamics.

Args:
    policy: [S, A] shaped matrix representing the policy.
    env: OpenAI env. env.P represents the transition probabilities of the environment.
        env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
        env.nS is a number of states in the environment. 
        env.nA is a number of actions in the environment.
    theta: We stop evaluation once our value function change is less than theta for all states.
    discount_factor: Gamma discount factor.

Returns:
    Vector of length env.nS representing the value function.
"""
# Start with a random (all 0) value function
while True:
    delta = 0
    # For each state, perform a "full backup"
    for s in range(env.nS):
        v = 0
        # Look at the possible next actions
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                # Calculate the expected value
                v += action_prob * prob * (reward + discount_factor * V[next_state])
        # How much our value function changed (across any states)
        delta = max(delta, np.abs(v - V[s]))
        V[s] = v
    # Stop evaluating once our value function change is below a threshold
    print(V)
    if delta < theta:
        break
    else:
        return np.array(V)
return np.array(V)

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0): """ Policy Improvement Algorithm. Iteratively evaluates and improves a policy until an optimal policy is found.

Args:
    env: The OpenAI envrionment.
    policy_eval_fn: Policy Evaluation function that takes 3 arguments:
        policy, env, discount_factor.
    discount_factor: gamma discount factor.
    
Returns:
    A tuple (policy, V). 
    policy is the optimal policy, a matrix of shape [S, A] where each state s
    contains a valid probability distribution over actions.
    V is the value function for the optimal policy.
    
"""
# Start with a random policy
policy = np.ones([env.nS, env.nA]) / env.nA

#Finds the value function for the random policy
V = policy_eval(policy, env)

while True:
   
    #Updates the new policy from the value function
    policy_stable = True
    
    #For evry state
    for s in range(env.nS):
        
        #Get the present policy
        old_action = policy[s].copy()
        
        #Variable for best action
        best_action = np.zeros([env.nA])
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                
                # Calculate the value function for every action
                best_action[a] = prob * (reward + discount_factor * V[next_state])
        #Pick the best actions
        best_action = (best_action == best_action.max()).astype(int)
        
        #Change the best actions in probabilty
        policy[s] = [float(i)/sum(best_action ) for i in  best_action]
        
        #Change the end game in 0 probabilty
        if s == 0 or s == env.nS-1:
            policy[s] = 0
        
        #Verifiy if policy improves
        if (old_action != policy[s]).all():
            policy_stable = False
    
    #If policy dosent improve return best policy and V
    if policy_stable:
        return (policy,V)
        break
    
    #Policy evaluation with next policy
    V = policy_eval(policy, env,V)
return policy, np.zeros(env.nS)

radumihai28 avatar Apr 22 '20 19:04 radumihai28