reinforcement-learning
reinforcement-learning copied to clipboard
Policy evaluation formulation
in the policy evaluation and policy iteration solution.ipynb; why is the value fuction calculated with the below equation. v += action_prob * prob * (reward + discount_factor * V[next_state])
Shouldn't the value function be calculated as per the below equation v += action_prob * (reward + prob*discount_factor * V[next_state]) The agent gets the reward as soon as it takes the action and the transistion probability is multiplied nly to the value function of the next state.
Correct me if I am wrong
See page 59 of the textbook.
Thanks a lot
On 17 Jan 2018 12:44 a.m., "Haonan Chen" [email protected] wrote:
See page 59 of the textbook http://incompleteideas.net/book/bookdraft2018jan1.pdf.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/130#issuecomment-358072614, or mute the thread https://github.com/notifications/unsubscribe-auth/AXPaYthT0s6y2ZVCGct6S4adMLR6qpKoks5tLPUWgaJpZM4RfRzk .
In David silvers lecture, policy evaluation is always written with only state transition probability summed up on all rewards. Got confused with the notations
On 17 Jan 2018 9:01 a.m., "jaijith s" [email protected] wrote:
Thanks a lot
On 17 Jan 2018 12:44 a.m., "Haonan Chen" [email protected] wrote:
See page 59 of the textbook http://incompleteideas.net/book/bookdraft2018jan1.pdf.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/130#issuecomment-358072614, or mute the thread https://github.com/notifications/unsubscribe-auth/AXPaYthT0s6y2ZVCGct6S4adMLR6qpKoks5tLPUWgaJpZM4RfRzk .
It's actually the same.
\Sum_{next_state, r} prob(next_state, r| s, a) * (reward + discount_factor * V[next_state])
= \Sum_{next_state, r} prob(next_state, r| s, a) * reward + \Sum_{next_state, r} prob(next_state, r| s, a) * discount_factor * V[next_state])
And we have \Sum_{next_state, r} prob(next_state, r| s, a)
= 1.
So the equation above is equal to reward + \Sum_{next_state, r} prob(next_state, r| s, a) * discount_factor * V[next_state])