stable-baselines3 [Bug] HER is not updating the done flag of HER transitions

🐛 [Bug] Hindsight experience replay (HER) is not updating the done flag of HER transitions

When sampling HER transitions, the SB3 implementation calculates a new reward but not a new done flag. The idea of HER is to sample successful goal transitions by replacing the desired goal with actually achieved goals in the episode. This means that we need to calculate a new reward for each new HER transition. This is done in https://github.com/DLR-RM/stable-baselines3/blob/7b977d7b0344f2828c5af863814fc82868a8af2f/stable_baselines3/her/her_replay_buffer.py#L354-L366. However, we also need to calculate a new done flag, since a successful goal transition should also result in a finished episode. This could be achieved by adding

  transitions["done"][her_indices, 0] = self.env.env_method(
      "compute_done",
      transitions["next_achieved_goal"][her_indices, 0],
      transitions["desired_goal"][her_indices, 0],
      transitions["info"][her_indices, 0],
  )

just after the re-computation of the reward.

Why is this critical? Most RL algorithms use this done flag to update the Q-values. E.g. SAC in https://github.com/DLR-RM/stable-baselines3/blob/7b977d7b0344f2828c5af863814fc82868a8af2f/stable_baselines3/sac/sac.py#L237

Experimental results

I temporarily fixed the issue with a monkey patch to compare the training results with and without updated done flag. My setup is a robot manipulator similar to https://github.com/qgallouedec/panda-gym/ but it has dynamic obstacles in the scene. Therefore, it can also happen that an episode ends by colliding with the obstacle. I'm using SAC + HER for training the agent.

It would be a bit much to share my entire environment, so I just upload the results with and without done flag update. If you want me to test this change on the panda gym environment, I could also try to get that running.

BEFORE bug fix

AFTER bug fix

Checklist

[x] I have checked that there is no similar issue in the repo (required)
[x] I have read the documentation (required)
[ ] I have provided a minimal working example to reproduce the bug (required)

Oct 26 '21 10:10 JakobThumm

Hello,

he SB3 implementation calculates a new reward but not a new done flag.

I agree but

This could be achieved by adding

this is unfortunately not a method part of the gym specification, and it does not account for the done due to timeout (but that should be fine).

It would be a bit much to share my entire environment, so I just upload the results with and without done flag update. If you want me to test this change on the panda gym environment, I could also try to get that running.

what you should do instead is use the info dict to store whether there was a collision or not (unless you can deduce that from the achieved goal only) and change the reward accordingly in compute_reward(), no?

Oct 26 '21 13:10 araffin

Hi again araffin,

what you should do instead is use the info dict to store whether there was a collision or not (unless you can deduce that from the achieved goal only) and change the reward accordingly in compute_reward(), no?

In my opinion, it is a flaw, that the GoalEnv in the gym does not have a compute_done() function. However, I guess it would be possible to let compute_reward() do that for us. Then, the question is, if transitions["info"][her_indices, 0] is a mutable object.

After changing the info entries, we would need to change https://github.com/DLR-RM/stable-baselines3/blob/7b977d7b0344f2828c5af863814fc82868a8af2f/stable_baselines3/her/her_replay_buffer.py#L385-L391 to

return DictReplayBufferSamples(
                observations=normalized_obs,
                actions=self.to_torch(transitions["action"]),
                next_observations=next_obs,
                dones=self.to_torch(transitions["info"][:,0]["done"]),
                rewards=self.to_torch(self._normalize_reward(transitions["reward"], maybe_vec_env)),
            )

right?

This could be a way of solving the issue without having to add a new function to all environments. However, most environments do not change the done information in the compute_reward() function. It is very counterintuitive as well.

I would prefer to go with a more functional design and have a specific function for calculating the done flag.

I also want to highlight that this problem should be independent of the fact if you have possible collisions in your env or not. The calculation of the q-values is still based on the done flag, which indicates a reached goal as well as a collision.

Oct 26 '21 14:10 JakobThumm

stable-baselines3 stable-baselines3 copied to clipboard

[Bug] HER is not updating the done flag of HER transitions

🐛 [Bug] Hindsight experience replay (HER) is not updating the done flag of HER transitions

Experimental results

BEFORE bug fix

AFTER bug fix

Checklist

stable-baselines3
stable-baselines3 copied to clipboard