stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Question] why SAC train method updates the target every steps?

Open angel-ayala opened this issue 5 months ago • 2 comments

❓ Question

Hi, I was looking into the code inside the SAC algorithm, and I encountered an odd line here.

When comparing to make the update of the target, should not be using self._n_updates, or something similar instead gradient_step? such variable is always zero, hence, is always entering in the condition?.

There is maybe any specific reason to do it this way?

Thanks!

Checklist

angel-ayala avatar Jul 23 '25 02:07 angel-ayala

Hello, i think that was to mimic DQN and have some kind of delayed update (but this is not used by default). Gradient steps is not always zero (it depends in the hyperparameters) but a better way would be to have policy delay like in the TD3 or SBX SAC implementation : https://github.com/DLR-RM/stable-baselines3/blob/2dce430161087c46b9e668f56f3c94c634eebff8/stable_baselines3/td3/td3.py#L197 and https://github.com/araffin/sbx/blob/849e90805c992b1b6b143105b5927e3eb7db2472/sbx/sac/sac.py#L443

For short, you can ignore that line that should probably be removed or the code should be updated to have proper policy delay implemented.

EDIT: the commit that introduce this line https://github.com/DLR-RM/stable-baselines3/commit/322399e8fefc9dceed5a13bb389ef374168e32c5

araffin avatar Jul 23 '25 14:07 araffin

That's what I was thinking, in TD3 is well implemented since it correctly considers the number of updates to have policy delay, I'm not sure if self._n_updates is coherent with the number of steps. I mentioned that the variable gradient_step will always be 0, in those cases when gradient_steps variable is 1.

I will try to update this behavior and create a PR if you agree.

angel-ayala avatar Jul 23 '25 15:07 angel-ayala