reinforcement-learning
reinforcement-learning copied to clipboard
Provided policy_improvement() solution initializes values to zero for each iteration
Provided solution does not follow the pseudocode on p. 102 exactly. It initializes policy evaluation with zeros each time, even though the book says: "Note that each policy evaluation, itself an iterative computation, is started with the value function for the previous policy." This change does not provide improvement in the "gridworld" example, but may speed up convergence in more complex examples.
It makes sense to change policy_eval
signature to accept initial value for V
, something like this:
def policy_eval(policy, env, discount_factor=1.0, theta=0.00001, V_init=np.zeros(env.nS)):
...
V_init: initial value function vector.
...
V = V_init
and change policy_improvement
to pass previous value to policy_eval
.
See also related issues about another bug in the function (#203) and its naming (#202).