reinforcement-learning update from upstream & make the implement more robust and meaningful in DP/Policy Evaluation Solution

update from upstream & make the implement more robust and meaningful in DP/Policy Evaluation Solution

Open liu-jc opened this issue 7 years ago • 1 comments

update the description.
make it more robust. I think in the for loop 'for prob, next_state, reward, done in env.P[s][a]:', we need to firstly sum the values from every next states. Because if there is more than one tuple in env.P[s][a], it will return the wrong result. Though it's Insignificant now, due to there is only one tuple whose probability is 1.0 .
According to the slide by David Silver, for all states s, V_{k+1}(s) should be update from V_{k}. So I use a new_V to update V. Maybe it's more reasonable?

Aug 24 '17 04:08 liu-jc

Hi, thank you! I need to look more closely at this in a few days.

In terms of 3., I think both work. Updating immediately tends to converge faster. The slides don't have this, but the book has a proof for it.

Aug 27 '17 01:08 dennybritz