reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

Provided policy_improvement() solution initializes values to zero for each iteration

Open link2xt opened this issue 5 years ago • 2 comments

Provided solution does not follow the pseudocode on p. 102 exactly. It initializes policy evaluation with zeros each time, even though the book says: "Note that each policy evaluation, itself an iterative computation, is started with the value function for the previous policy." This change does not provide improvement in the "gridworld" example, but may speed up convergence in more complex examples.

It makes sense to change policy_eval signature to accept initial value for V, something like this:

def policy_eval(policy, env, discount_factor=1.0, theta=0.00001, V_init=np.zeros(env.nS)):
...
        V_init: initial value function vector.
...
    V = V_init

and change policy_improvement to pass previous value to policy_eval.

See also related issues about another bug in the function (#203) and its naming (#202).

link2xt avatar Jun 23 '19 22:06 link2xt