[Question] why SAC train method updates the target every steps?
❓ Question
Hi, I was looking into the code inside the SAC algorithm, and I encountered an odd line here.
When comparing to make the update of the target, should not be using self._n_updates, or something similar instead gradient_step? such variable is always zero, hence, is always entering in the condition?.
There is maybe any specific reason to do it this way?
Thanks!
Checklist
- [x] I have checked that there is no similar issue in the repo
- [x] I have read the documentation
- [ ] If code there is, it is minimal and working
- [ ] If code there is, it is formatted using the markdown code blocks for both code and stack traces.
Hello, i think that was to mimic DQN and have some kind of delayed update (but this is not used by default). Gradient steps is not always zero (it depends in the hyperparameters) but a better way would be to have policy delay like in the TD3 or SBX SAC implementation : https://github.com/DLR-RM/stable-baselines3/blob/2dce430161087c46b9e668f56f3c94c634eebff8/stable_baselines3/td3/td3.py#L197 and https://github.com/araffin/sbx/blob/849e90805c992b1b6b143105b5927e3eb7db2472/sbx/sac/sac.py#L443
For short, you can ignore that line that should probably be removed or the code should be updated to have proper policy delay implemented.
EDIT: the commit that introduce this line https://github.com/DLR-RM/stable-baselines3/commit/322399e8fefc9dceed5a13bb389ef374168e32c5
That's what I was thinking, in TD3 is well implemented since it correctly considers the number of updates to have policy delay, I'm not sure if self._n_updates is coherent with the number of steps.
I mentioned that the variable gradient_step will always be 0, in those cases when gradient_steps variable is 1.
I will try to update this behavior and create a PR if you agree.