stable-baselines3 [Bug] Local variable 'values' not updated in the callback for the last timestep

🐛 Bug

In the rollout collection of PPO, the callback references of local variable values is not updated after the value for the last timestep has been predicted. https://github.com/DLR-RM/stable-baselines3/blob/ed308a71be24036744b5ad4af61b083e4fbdf83c/stable_baselines3/common/on_policy_algorithm.py#L210

Due to this, at the end of the rollout when using callback.on_rollout_end(), locals["values"] may not match with the actual final values of the local variable values

This might lead to undesirable effects for users relying on the last values in the callback.

One simple fix is to update the locals after the self.policy.predict_values(..) by simply calling callback.update_locals(locals()) again as shown below. Further, it is also a better not to replace values and rather use a new variable last_values or next_values to distinguish that it refers to predicted values corresponding to new_obs where as values previously correspond to obs (just to keep in sync with other variables rewards, actions, infos etc)

 with th.no_grad():
    # Compute value for the last timestep
    next_values = self.policy.predict_values(
        obs_as_tensor(new_obs, self.device))

   # Update callback's locals
   callback.update_locals(locals())

rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)

callback.on_rollout_end()

Expected behavior

Local variables inside the callback function are in sync with the actual local variables.

System Info

OS: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-debian-buster-sid #1 SMP Wed Oct 13 17:20:51 UTC 2021 Python: 3.7.11 Stable-Baselines3: 1.4.0 PyTorch: 1.10.2+cu102 GPU Enabled: True Numpy: 1.21.5 Gym: 0.19.0

Checklist

[x] I have checked that there is no similar issue in the repo (required)
[x] I have read the documentation (required)
[x] I have provided a minimal working example to reproduce the bug (required)

Apr 15 '22 00:04 rajcscw

Hey! Thanks for raising this. Next time remember to include minimal code to reproduce things ;)

I agree with the fix, although I would not change the variable names, in case somebody already used these variables in the past (also the last_values=values should be updated). @araffin any callback specific comments on this?

Apr 15 '22 09:04 Miffyli

Hello,

This might lead to undesirable effects for users relying on the last values in the callback.

I'm curious to know more about your use-case. Callback may not be the appropriate way.

Otherwise, yes calling callback.update_locals(locals()) should solve the issue.

Apr 15 '22 09:04 araffin

@Miffyli @araffin Thanks for getting back. Just to be clear. We can just introduce a new variable next_values for the line self.policy.predict_values(obs_as_tensor(new_obs, self.device)) and then update the locals in the callback. So, therefore values will not be changed and users who used this in the past are not affected.

Regarding my Use-case: Once the rollout is completed, I have to re-adjust the rewards in the rollout buffer and re-compute the advantages and returns (for this, I need last_dones and last_values which I can grab from callback.locals). Only way I can do this without modifying code on SB3 is to include this logic inside callback._on_rollout_end() right or is there another way?

Apr 15 '22 11:04 rajcscw

that's what i thought, you should fork SB3 for your usecase, otherwise you would compute the advantage twice and will have hacks with side effects here and there if you use callbacks. Or if you just need to adjust rewards, you can use a gym wrapper.

Apr 15 '22 12:04 araffin

True, I agree using callbacks is kinda hacky. However, I think that is the only option for me if I do not want to fork SB3 because I work on custom envs that are compatible with Sb3 out of the box (so can plug-in different algorithms easily). Also, there is a cyclic dependency between env (reward ) and policy since policy is required to adjust the rewards. Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead.

Therefore, callbacks is the perfect place where I have access to policy, env and can adjust the rewards, advantages etc.

So while working on this, I noticed that the values were out of sync (the problem I mentioned above). It would be sufficient just to callback.update_locals(locals()) before callback.on_rollout_end(). What do you think @Miffyli @araffin ? I can open a PR for this.

Apr 15 '22 16:04 rajcscw

If you use callback for accessing both model and env, this is the same as forking, but more hacky...

Since I use SubProcEnv for parallel rollouts, it is not efficient to send the policy to env (reward function) frequently due to communication overhead.

You could do that in the collect_rollout() then.

hat do you think @Miffyli @araffin ? I can open a PR for this.

Please do =) (should be a one line change, please don't forget the changelog)

Apr 17 '22 14:04 araffin

stable-baselines3 stable-baselines3 copied to clipboard

[Bug] Local variable 'values' not updated in the callback for the last timestep

🐛 Bug

Expected behavior

System Info

Checklist

stable-baselines3
stable-baselines3 copied to clipboard