stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[bug] Adaptive SAC: using logarithm of entropy coefficient to compute temperature objective instead of entropy coefficient

Open Mattia-sony opened this issue 1 year ago • 1 comments

In the paper, equation (18), the entropy coefficient is used directly, while in the sb3 implementation its logarithm is used (here). This way, the temperature coefficient used in the critic and actor objectives can be orders of magnitude different from that used to adjust its value in the temperature objective (J($\alpha)$ with the paper notation).

We might want to change the line I referenced into: ent_coef_loss = -(th.exp(self.log_ent_coef) * (log_prob + self.target_entropy).detach()).mean()

Or is there a reason for using the logarithm here?

Mattia-sony avatar Sep 24 '24 14:09 Mattia-sony