stable-baselines3
stable-baselines3 copied to clipboard
[bug] Adaptive SAC: using logarithm of entropy coefficient to compute temperature objective instead of entropy coefficient
In the paper, equation (18), the entropy coefficient is used directly, while in the sb3 implementation its logarithm is used (here). This way, the temperature coefficient used in the critic and actor objectives can be orders of magnitude different from that used to adjust its value in the temperature objective (J($\alpha)$ with the paper notation).
We might want to change the line I referenced into:
ent_coef_loss = -(th.exp(self.log_ent_coef) * (log_prob + self.target_entropy).detach()).mean()
Or is there a reason for using the logarithm here?