stable-baselines [question] SAC target q nets may be updated many times in each round?

This line of code: https://github.com/hill-a/stable-baselines/blob/d288b6d911094268b17abb61dbe9ac3ca05e1ce3/stable_baselines/sac/sac.py#L464

since step is in range(total_timesteps) and grad_step is nested in range(self.gradient_steps), then if self.gradient_steps is not 1, the target networks are updated repeatedly?

Jun 21 '20 22:06 xuanqing94

I have not rehearsed SAC internals, but if goal is to update target network every target_update_interval updates to main network, then multiple updates per step are to be expected. However there is a related mistake here: with gradient_steps > 1, the timing of updates gets screwed up. See example below. Zeros should be equally far apart from each other.

@araffin Can you confirm and check this?

for i in range(10):
    for j in range(3):
        print((i + j) % 4)
        
0
1
2
1
2
3
2
3
0
3
0
1
0
1
2
1
2
3
2
3
0
3
0
1
0
1
2
1
2
3

Jun 21 '20 23:06 Miffyli

Can you confirm and check this?

In short: yes it is a bug but luckily target_update_interval=1, so it won't change any results :sweat_smile:.

Looking at the official implementation, there is apparently something wrong in their implementation:

https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L294

https://github.com/rail-berkeley/softlearning/blob/daf076b5042625704f63373b145509f716e67819/softlearning/algorithms/rl_algorithm.py#L345

They check iteration % self._target_update_interval == 0 but ìteration is the same for several gradient steps...

@hartikainen could you comment on that?

We may also need to fix that in SB3, the implementation has the same behavior as the original one and followed what is done in TD3: the target networks are updated after each gradient step.

Jun 22 '20 08:06 araffin

Hey @araffin. The iteration is intentionally constant for all the gradient steps in softlearning code. Basically the logic currently checks if we should train on this iteration and if so, then apply self._n_train_repeat gradient steps. However, I think the way these values are configured in softlearning is really confusing so you all might want to think a better way to configure these 🙂 For SAC, it doesn't really matter what the exact gradient step interval is, that is, it's pretty much the same to apply x/y gradient steps after each collected sample or Nx gradient steps after every Ny samples, as long as the ratio between gradient steps and environment samples stays the same and the gradient interval is somewhat reasonable (e.g. sampling a full 1000 step episode should work for sure).

Jun 22 '20 10:06 hartikainen

Thanks @hartikainen for your answer.

to apply x/y gradient steps after each collected sample or Nx gradient steps after every Ny samples

yes, I have also experienced that. My question was more about the target network update. In DQN, it is independent of the gradient update. In TD3, there is one soft update per gradient step. For SAC, it is the same as TD3 for now, but the current code in soft q learning seems to be closer to DQN.... (currently it does not change anything though because target_update_interval=1.

Jun 22 '20 10:06 araffin

Ah I see, yeah. The target update in my opinion should be dependent of the gradient updates so probably preferable to do what TD3 does.

Jun 22 '20 11:06 hartikainen

stable-baselines stable-baselines copied to clipboard

[question] SAC target q nets may be updated many times in each round?

stable-baselines
stable-baselines copied to clipboard