stable-baselines
stable-baselines copied to clipboard
[question] SAC target q nets may be updated many times in each round?
This line of code: https://github.com/hill-a/stable-baselines/blob/d288b6d911094268b17abb61dbe9ac3ca05e1ce3/stable_baselines/sac/sac.py#L464
since step
is in range(total_timesteps)
and grad_step
is nested in range(self.gradient_steps)
, then if self.gradient_steps is not 1, the target networks are updated repeatedly?
I have not rehearsed SAC internals, but if goal is to update target network every target_update_interval
updates to main network, then multiple updates per step
are to be expected. However there is a related mistake here: with gradient_steps > 1
, the timing of updates gets screwed up. See example below. Zeros should be equally far apart from each other.
@araffin Can you confirm and check this?
for i in range(10):
for j in range(3):
print((i + j) % 4)
0
1
2
1
2
3
2
3
0
3
0
1
0
1
2
1
2
3
2
3
0
3
0
1
0
1
2
1
2
3
Can you confirm and check this?
In short: yes it is a bug but luckily target_update_interval=1
, so it won't change any results :sweat_smile:.
Looking at the official implementation, there is apparently something wrong in their implementation:
https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L294
https://github.com/rail-berkeley/softlearning/blob/daf076b5042625704f63373b145509f716e67819/softlearning/algorithms/rl_algorithm.py#L345
They check iteration % self._target_update_interval == 0
but ìteration
is the same for several gradient steps...
@hartikainen could you comment on that?
We may also need to fix that in SB3, the implementation has the same behavior as the original one and followed what is done in TD3: the target networks are updated after each gradient step.
Hey @araffin. The iteration is intentionally constant for all the gradient steps in softlearning code. Basically the logic currently checks if we should train on this iteration
and if so, then apply self._n_train_repeat
gradient steps. However, I think the way these values are configured in softlearning is really confusing so you all might want to think a better way to configure these 🙂 For SAC, it doesn't really matter what the exact gradient step interval is, that is, it's pretty much the same to apply x/y
gradient steps after each collected sample or Nx
gradient steps after every Ny
samples, as long as the ratio between gradient steps and environment samples stays the same and the gradient interval is somewhat reasonable (e.g. sampling a full 1000 step episode should work for sure).
Thanks @hartikainen for your answer.
to apply x/y gradient steps after each collected sample or Nx gradient steps after every Ny samples
yes, I have also experienced that. My question was more about the target network update.
In DQN, it is independent of the gradient update. In TD3, there is one soft update per gradient step. For SAC, it is the same as TD3 for now, but the current code in soft q learning seems to be closer to DQN.... (currently it does not change anything though because target_update_interval=1
.
Ah I see, yeah. The target update in my opinion should be dependent of the gradient updates so probably preferable to do what TD3 does.