stable-baselines3
stable-baselines3 copied to clipboard
[Bug] HER is not updating the done flag of HER transitions
🐛 [Bug] Hindsight experience replay (HER) is not updating the done flag of HER transitions
When sampling HER transitions, the SB3 implementation calculates a new reward but not a new done flag.
The idea of HER is to sample successful goal transitions by replacing the desired goal with actually achieved goals in the episode.
This means that we need to calculate a new reward for each new HER transition.
This is done in https://github.com/DLR-RM/stable-baselines3/blob/7b977d7b0344f2828c5af863814fc82868a8af2f/stable_baselines3/her/her_replay_buffer.py#L354-L366.
However, we also need to calculate a new done
flag, since a successful goal transition should also result in a finished episode.
This could be achieved by adding
transitions["done"][her_indices, 0] = self.env.env_method(
"compute_done",
transitions["next_achieved_goal"][her_indices, 0],
transitions["desired_goal"][her_indices, 0],
transitions["info"][her_indices, 0],
)
just after the re-computation of the reward.
Why is this critical?
Most RL algorithms use this done
flag to update the Q-values.
E.g. SAC in https://github.com/DLR-RM/stable-baselines3/blob/7b977d7b0344f2828c5af863814fc82868a8af2f/stable_baselines3/sac/sac.py#L237
Experimental results
I temporarily fixed the issue with a monkey patch to compare the training results with and without updated done flag. My setup is a robot manipulator similar to https://github.com/qgallouedec/panda-gym/ but it has dynamic obstacles in the scene. Therefore, it can also happen that an episode ends by colliding with the obstacle. I'm using SAC + HER for training the agent.
It would be a bit much to share my entire environment, so I just upload the results with and without done flag update. If you want me to test this change on the panda gym environment, I could also try to get that running.
BEFORE bug fix
AFTER bug fix
Checklist
- [x] I have checked that there is no similar issue in the repo (required)
- [x] I have read the documentation (required)
- [ ] I have provided a minimal working example to reproduce the bug (required)
Hello,
he SB3 implementation calculates a new reward but not a new done flag.
I agree but
This could be achieved by adding
this is unfortunately not a method part of the gym specification, and it does not account for the done due to timeout (but that should be fine).
It would be a bit much to share my entire environment, so I just upload the results with and without done flag update. If you want me to test this change on the panda gym environment, I could also try to get that running.
what you should do instead is use the info
dict to store whether there was a collision or not (unless you can deduce that from the achieved goal only) and change the reward accordingly in compute_reward()
, no?
Hi again araffin,
what you should do instead is use the info dict to store whether there was a collision or not (unless you can deduce that from the achieved goal only) and change the reward accordingly in compute_reward(), no?
In my opinion, it is a flaw, that the GoalEnv
in the gym does not have a compute_done()
function.
However, I guess it would be possible to let compute_reward()
do that for us.
Then, the question is, if transitions["info"][her_indices, 0]
is a mutable object.
After changing the info entries, we would need to change https://github.com/DLR-RM/stable-baselines3/blob/7b977d7b0344f2828c5af863814fc82868a8af2f/stable_baselines3/her/her_replay_buffer.py#L385-L391 to
return DictReplayBufferSamples(
observations=normalized_obs,
actions=self.to_torch(transitions["action"]),
next_observations=next_obs,
dones=self.to_torch(transitions["info"][:,0]["done"]),
rewards=self.to_torch(self._normalize_reward(transitions["reward"], maybe_vec_env)),
)
right?
This could be a way of solving the issue without having to add a new function to all environments. However, most environments do not change the done
information in the compute_reward()
function. It is very counterintuitive as well.
I would prefer to go with a more functional design and have a specific function for calculating the done flag.
I also want to highlight that this problem should be independent of the fact if you have possible collisions in your env or not. The calculation of the q-values is still based on the done flag, which indicates a reached goal as well as a collision.