reinforcement-learning
reinforcement-learning copied to clipboard
update from upstream & make the implement more robust and meaningful in DP/Policy Evaluation Solution
- update the description.
- make it more robust. I think in the for loop 'for prob, next_state, reward, done in env.P[s][a]:', we need to firstly sum the values from every next states. Because if there is more than one tuple in env.P[s][a], it will return the wrong result. Though it's Insignificant now, due to there is only one tuple whose probability is 1.0 .
- According to the slide by David Silver, for all states s, V_{k+1}(s) should be update from V_{k}. So I use a new_V to update V. Maybe it's more reasonable?
Hi, thank you! I need to look more closely at this in a few days.
In terms of 3., I think both work. Updating immediately tends to converge faster. The slides don't have this, but the book has a proof for it.