ElegantRL
ElegantRL copied to clipboard
A policy update bug in AgentPPO?
The following codes show that the policy used to explore the env (generate the action and logprob) is 'self.act',
get_action = self.act.get_action
convert = self.act.convert_action_for_env
for i in range(horizon_len):
state = torch.as_tensor(ary_state, dtype=torch.float32, device=self.device)
action, logprob = [t.squeeze() for t in get_action(state.unsqueeze(0))]
while in the update function, the actions and policy used to calculate the 'new_log_prob' are exactly the same as the ones above:
new_logprob, obj_entropy = self.act.get_logprob_entropy(state, action)
ratio = (new_logprob - logprob.detach()).exp()
I think that 'ratio' will be always 1. Is it a bug or there is something I misunderstand?
The follow Pull request fix this bug ↓ Fix bug for vec env and agentbase init #248
https://github.com/AI4Finance-Foundation/ElegantRL/pull/248