deep-rl-class icon indicating copy to clipboard operation
deep-rl-class copied to clipboard

[QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples?

Open ritwikmishra opened this issue 4 months ago • 2 comments

I am referring to the gradient derivation here.

The paragraph where the instructor claimed "we can approximate the likelihood ratio policy gradient with sample-based estimate" then term of P(τ;θ) (probability of trajectory τ given the parameters θ) disappeared in the subsequent summation. Why?

I asked the same question on the discord study-group (here) but got no response.

ritwikmishra avatar Feb 29 '24 15:02 ritwikmishra

Hey there 👋 image

So P(tau;theta) is The probability of a trajectory but we can't have it. Since it would imply to know the environment dynamics (state dist)

If you look at the formulas after what we do is:

  • Replace P(τ;θ) (impossible to calculate)
  • With image where tau(i) is a sampled trajectory

Don't hesitate to take a piece of paper and write each part step by step to understand better. It's how I've did it.

simoninithomas avatar Mar 05 '24 09:03 simoninithomas

@simoninithomas I am sorry but it is still unclear to me. My doubt is... how we jumped from this image

to this image

shouldn't it be as follows:

image

ritwikmishra avatar Mar 08 '24 06:03 ritwikmishra