deep-rl-class
deep-rl-class copied to clipboard
[QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples?
I am referring to the gradient derivation here.
The paragraph where the instructor claimed "we can approximate the likelihood ratio policy gradient with sample-based estimate" then term of P(τ;θ) (probability of trajectory τ given the parameters θ) disappeared in the subsequent summation. Why?
I asked the same question on the discord study-group (here) but got no response.
Hey there 👋
So P(tau;theta) is The probability of a trajectory but we can't have it. Since it would imply to know the environment dynamics (state dist)
If you look at the formulas after what we do is:
- Replace P(τ;θ) (impossible to calculate)
- With
where tau(i) is a sampled trajectory
Don't hesitate to take a piece of paper and write each part step by step to understand better. It's how I've did it.
@simoninithomas
I am sorry but it is still unclear to me. My doubt is... how we jumped from this
to this
shouldn't it be as follows: