reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

Vectorizing Policy Evaluation

Open hsz1992 opened this issue 7 years ago • 3 comments

def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):
    v = np.zeros(shape=(env.nS, 1))      # value vector index by state
    R = np.zeros(shape=(env.nS, 1))      # reward vector index by state
    P = np.zeros(shape=(env.nS, env.nS)) # state transition matrix (from, to)
    
    # Construct R and P
    for s in range(env.nS):
        for a, action_prob in enumerate(policy[s]):
            for prob, next_state, reward, done in env.P[s][a]:
                R[s] += action_prob * prob * reward
                P[s][next_state] += action_prob * prob
    
    # Start iterating
    while True:
        v_prev = v
        v = R + discount_factor * np.dot(P, v)
        if np.max(np.abs(v - v_prev)) < theta:
            break
    return np.squeeze(v)
random_policy = np.ones([env.nS, env.nA]) / env.nA
%timeit v = policy_eval(random_policy, env)
# Output: 2.17 ms ± 62 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

hsz1992 avatar Mar 09 '18 06:03 hsz1992

It looks like this belongs in a pull request

jonahweissman avatar Mar 09 '18 14:03 jonahweissman

Thanks for the code! I did not vectorize the implementation on purpose because this repository is meant as a learning tool, and I think the code is a bit more intuitive if it's not vectorized.

dennybritz avatar Mar 10 '18 01:03 dennybritz

@zuanzuan1992 just what I was looking for!

iSarCasm avatar Mar 24 '18 19:03 iSarCasm