ElegantRL A policy update bug in AgentPPO?

A policy update bug in AgentPPO?

Open huge123 opened this issue 2 years ago • 1 comments

The following codes show that the policy used to explore the env (generate the action and logprob) is 'self.act',

get_action = self.act.get_action 
convert = self.act.convert_action_for_env
for i in range(horizon_len):
        state = torch.as_tensor(ary_state, dtype=torch.float32, device=self.device)
        action, logprob = [t.squeeze() for t in get_action(state.unsqueeze(0))]

while in the update function, the actions and policy used to calculate the 'new_log_prob' are exactly the same as the ones above:

new_logprob, obj_entropy = self.act.get_logprob_entropy(state, action)
ratio = (new_logprob - logprob.detach()).exp()

I think that 'ratio' will be always 1. Is it a bug or there is something I misunderstand?

Dec 19 '22 15:12 huge123

The follow Pull request fix this bug ↓ Fix bug for vec env and agentbase init #248

https://github.com/AI4Finance-Foundation/ElegantRL/pull/248

Jan 09 '23 02:01 Yonv1943

ElegantRL ElegantRL copied to clipboard

A policy update bug in AgentPPO?

ElegantRL
ElegantRL copied to clipboard