ElegantRL A qusetion about the code of 'ActorSAC' class in net.py

A qusetion about the code of 'ActorSAC' class in net.py

Open legao-2 opened this issue 2 years ago • 1 comments

I'm confused why we use 'logprob = dist.log_prob(a_avg)' instead of 'logprob = dist.log_prob(action)' in line 247 of elegantrl/agents/net.py. I think the latter is consistent to the original paper. Is using the former better in experiment?

def get_action_logprob(self, state):
        state = self.state_norm(state)
        s_enc = self.net_s(state)  # encoded state
        a_avg, a_std_log = self.net_a(s_enc).chunk(2, dim=1)
        a_std = a_std_log.clamp(-16, 2).exp()
        dist = Normal(a_avg, a_std)
        action = dist.rsample()
        action_tanh = action.tanh()
        logprob = dist.log_prob(a_avg)
        logprob -= (-action_tanh.pow(2) + 1.000001).log()  # fix logprob using the derivative of action.tanh()
        return action_tanh, logprob.sum(1)

Mar 18 '23 07:03 legao-2

It is better.

You can read the webpage below for more information.

Update tanh bijector with numerically stable formula. https://github.com/tensorflow/probability/commit/ef6bb176e0ebd1cf6e25c6b5cecdd2428c22963f#diff-e120f70e92e6741bca649f04fcd907b7

    def log_abs_det_jacobian(self, x, y):
        # We use a formula that is more numerically stable, see details in the following link
        # https://github.com/tensorflow/probability/commit/ef6bb176e0ebd1cf6e25c6b5cecdd2428c22963f#diff-e120f70e92e6741bca649f04fcd907b7
        return 2. * (math.log(2.) - x - F.softplus(-2. * x))

https://github.com/AI4Finance-Foundation/ElegantRL/blob/dee9c6d095001bf8365c0359f0d04a021d8c1e22/elegantrl/agents/net.py

ElegantRL code comment https://github.com/AI4Finance-Foundation/ElegantRL/blob/dee9c6d095001bf8365c0359f0d04a021d8c1e22/elegantrl/agents/net.py#L344-L359

Mar 27 '23 01:03 Yonv1943

ElegantRL ElegantRL copied to clipboard

A qusetion about the code of 'ActorSAC' class in net.py

ElegantRL
ElegantRL copied to clipboard