reinforcement-learning
reinforcement-learning copied to clipboard
Provided policy_improvement() solution is not guaranteed to terminate
To set policy_stable
variable, provided code checks whether the policy is changed. If there are multiple optimal policies, the policy may change infinitely even though optimal policy is already found.
See Exercise 4.4 of the 2018 edition in Sutton & Barto book, it explicitly points out this bug in the pseudocode.
Also see related issue #202 about naming of the function.