higgsfield
higgsfield copied to clipboard
Gaussian distribution of the policy
I am just learning how to implement PPO for continuous action spaces and this repo has been a godsend except one point. If we assume that the output of the actor is the mu or mean of the policy, then I can see how the new action would always have a lower probability if based on the old policy's distribution. Thus, the ratio of probability of new policy / probability of old policy would always be between 0 and 1 so we can omit the clip of 1 + epsilon, ie. 1 - epsilon would work in all cases. Am I on the right path here? I know this is not stackoverflow but nobody is answering me over there...