[bug] Adaptive SAC: using logarithm of entropy coefficient to compute temperature objective instead of entropy coefficient

Open Mattia-sony opened this issue 1 year ago • 1 comments

In the paper, equation (18), the entropy coefficient is used directly, while in the sb3 implementation its logarithm is used (here). This way, the temperature coefficient used in the critic and actor objectives can be orders of magnitude different from that used to adjust its value in the temperature objective (J($\alpha)$ with the paper notation).

We might want to change the line I referenced into: ent_coef_loss = -(th.exp(self.log_ent_coef) * (log_prob + self.target_entropy).detach()).mean()

Or is there a reason for using the logarithm here?

Sep 24 '24 14:09 Mattia-sony

Duplicate of https://github.com/DLR-RM/stable-baselines3/issues/36 https://github.com/DLR-RM/stable-baselines3/issues/802 and https://github.com/DLR-RM/stable-baselines3/issues/712

PS: could you do a PR that a note about that in our doc?

Oct 02 '24 08:10 araffin