reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

Policy evaluation formulation

Open Jaijith opened this issue 7 years ago • 4 comments

in the policy evaluation and policy iteration solution.ipynb; why is the value fuction calculated with the below equation. v += action_prob * prob * (reward + discount_factor * V[next_state])

Shouldn't the value function be calculated as per the below equation v += action_prob * (reward + prob*discount_factor * V[next_state]) The agent gets the reward as soon as it takes the action and the transistion probability is multiplied nly to the value function of the next state.

Correct me if I am wrong

Jaijith avatar Jan 16 '18 05:01 Jaijith

See page 59 of the textbook.

chaonan99 avatar Jan 16 '18 19:01 chaonan99

Thanks a lot

On 17 Jan 2018 12:44 a.m., "Haonan Chen" [email protected] wrote:

See page 59 of the textbook http://incompleteideas.net/book/bookdraft2018jan1.pdf.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/130#issuecomment-358072614, or mute the thread https://github.com/notifications/unsubscribe-auth/AXPaYthT0s6y2ZVCGct6S4adMLR6qpKoks5tLPUWgaJpZM4RfRzk .

Jaijith avatar Jan 17 '18 03:01 Jaijith

In David silvers lecture, policy evaluation is always written with only state transition probability summed up on all rewards. Got confused with the notations

On 17 Jan 2018 9:01 a.m., "jaijith s" [email protected] wrote:

Thanks a lot

On 17 Jan 2018 12:44 a.m., "Haonan Chen" [email protected] wrote:

See page 59 of the textbook http://incompleteideas.net/book/bookdraft2018jan1.pdf.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/130#issuecomment-358072614, or mute the thread https://github.com/notifications/unsubscribe-auth/AXPaYthT0s6y2ZVCGct6S4adMLR6qpKoks5tLPUWgaJpZM4RfRzk .

Jaijith avatar Jan 17 '18 04:01 Jaijith

It's actually the same.

\Sum_{next_state, r} prob(next_state, r| s, a) * (reward + discount_factor * V[next_state]) = \Sum_{next_state, r} prob(next_state, r| s, a) * reward + \Sum_{next_state, r} prob(next_state, r| s, a) * discount_factor * V[next_state])

And we have \Sum_{next_state, r} prob(next_state, r| s, a) = 1.

So the equation above is equal to reward + \Sum_{next_state, r} prob(next_state, r| s, a) * discount_factor * V[next_state])

memoiry avatar Jan 17 '18 06:01 memoiry