cpo-pytorch icon indicating copy to clipboard operation
cpo-pytorch copied to clipboard

mean kl is always=0

Open xzhang2523 opened this issue 2 years ago • 5 comments
trafficstars

hi, I notice that in your code, mean_kl always=0 constraint_grad = flat_grad(constraint_loss, self.policy.parameters(), retain_graph=True) # (b)

    mean_kl = mean_kl_first_fixed(actions_dists, actions_dists)
    Fvp_fun = get_Hvp_fun(mean_kl, self.policy.parameters())

what is the meaning of a gradient of a constant?

xzhang2523 avatar Feb 21 '23 14:02 xzhang2523

How do you know that it is always 0? I have not looked at this code in a long time, but I'll do my best to help.

ajlangley avatar Feb 22 '23 17:02 ajlangley

image

in line 164, KL of two same distribution is 0.

and the imp-sampling is always 1.

Best, Xiaoyuan

xzhang2523 avatar Feb 23 '23 03:02 xzhang2523

The first line you highlighted is not a bug. If you notice, the second term is detached, so that we are taking the gradient of the importance sampling ratio w.r.t. the new policy. The second line you boxed is indeed a bug. One of those terms should be detached (whichever one represents the old policy). I believe the first input should be detached, the and the second one should require a gradient.

ajlangley avatar Feb 23 '23 04:02 ajlangley

Dear aj,

Actually, I have print the value of "imp_sampling" while debugging. It is actually, always 1.

xzhang2523 avatar Feb 23 '23 13:02 xzhang2523

Right, but does it have `require_grad=True'?

The intention is that it will always be 1 when you print it out, because what is really happening is that we are just setting up a symbolic expression involving the old policy, which we just used to collect the data, and the new policy, which we are going to find via gradient descent, and we are using the old policy as a starting point.

What I mean is, we are really setting up a function J(x) and taking the gradient at x = old_policy.

Maybe you can try printing out the sampling ratio between the old policy and the new policy after updating. If they are still 1, that is indeed a bug.

Does this make sense?

ajlangley avatar Feb 23 '23 15:02 ajlangley