deep-rl-class [QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples?

[QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples?

Open ritwikmishra opened this issue 4 months ago • 2 comments

I am referring to the gradient derivation here.

The paragraph where the instructor claimed "we can approximate the likelihood ratio policy gradient with sample-based estimate" then term of P(τ;θ) (probability of trajectory τ given the parameters θ) disappeared in the subsequent summation. Why?

I asked the same question on the discord study-group (here) but got no response.

Feb 29 '24 15:02 ritwikmishra

Hey there 👋

So P(tau;theta) is The probability of a trajectory but we can't have it. Since it would imply to know the environment dynamics (state dist)

If you look at the formulas after what we do is:

Replace P(τ;θ) (impossible to calculate)
With where tau(i) is a sampled trajectory

Don't hesitate to take a piece of paper and write each part step by step to understand better. It's how I've did it.

Mar 05 '24 09:03 simoninithomas

@simoninithomas I am sorry but it is still unclear to me. My doubt is... how we jumped from this

to this

shouldn't it be as follows:

Mar 08 '24 06:03 ritwikmishra

deep-rl-class deep-rl-class copied to clipboard

[QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples?

deep-rl-class
deep-rl-class copied to clipboard