reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

update from upstream & make the implement more robust and meaningful in DP/Policy Evaluation Solution

Open liu-jc opened this issue 7 years ago • 1 comments

  1. update the description.
  2. make it more robust. I think in the for loop 'for prob, next_state, reward, done in env.P[s][a]:', we need to firstly sum the values from every next states. Because if there is more than one tuple in env.P[s][a], it will return the wrong result. Though it's Insignificant now, due to there is only one tuple whose probability is 1.0 .
  3. According to the slide by David Silver, for all states s, V_{k+1}(s) should be update from V_{k}. So I use a new_V to update V. Maybe it's more reasonable?

liu-jc avatar Aug 24 '17 04:08 liu-jc

Hi, thank you! I need to look more closely at this in a few days.

In terms of 3., I think both work. Updating immediately tends to converge faster. The slides don't have this, but the book has a proof for it.

dennybritz avatar Aug 27 '17 01:08 dennybritz