stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Question] HER applied on GoalEnv with ObservationWrapper

Open ritalaezza opened this issue 2 years ago • 5 comments

Question

I am using stable-baselines3's implementation of HER with a custom environment, but I ran into problems in the reward computation step. The Gym environment is based on GoalEnv, but the observations consist of nested Dict spaces. To overcome this limitation, I flatten the contents of 'observation', 'desired_goal' and 'achieved_goal' into a single Box space, using an ObservationWrapper. This works fine when collecting data, but when the new goals (based on n_sampled_goal) are sampled, the compute_reward() function does not receive the 'desired_goal' and 'achieved_goal' in the correct Dict format, and gets a batch of flattened arrays instead.

Is there a way to overcome this problem?

Additional context

The custom Gym environment wraps the compute_reward() function around an abstract class to allow for simple exchange of reward functions.

Checklist

  • [x] I have read the documentation (required)
  • [x] I have checked that there is no similar issue in the repo (required)

ritalaezza avatar Aug 17 '22 12:08 ritalaezza

Hello, could you elaborate a little bit (and follow the custom env issue template by providing a minimal code example to reproduce the issue). Did you check your env using the env checker?

in the correct Dict format, and gets a batch of flattened arrays instead.

the signature of compute_reward() is env.compute_reward(obs["achieved_goal"], obs["desired_goal"], info) and we require compute_reward() to be vectorized (taking a batch of goals as input).

araffin avatar Aug 17 '22 13:08 araffin

compute_reward() function does not receive the 'desired_goal' and 'achieved_goal' in the correct Dict format, and gets a batch of flattened arrays instead.

It seems pretty obvious since you

flatten the contents of 'observation', 'desired_goal' and 'achieved_goal' into a single Box space

For HER to work, the observation space must be

spaces.Dict(
    {
        "observation": ...,
        "desired_goal": ...,
        "achieved_goal": ...
    }
)

qgallouedec avatar Aug 17 '22 14:08 qgallouedec

Hello,

Thank you for a speedy reply. I guess you just answered my question. If stable-baselines3 requires compute_reward() to be vectorized (taking a batch of goals as input), then this is not the case for my environment, and I'll have to work around this.

Currently the reward function processes the contents of the original nested Dict structures of 'achieved_goal' and 'desired_goal', not the flattened versions being passed to the learning code. This is the expected behavior of an ObservationWrapper:

Gym Documentation: "If you would like to apply a function to the observation that is returned by the base environment before passing it to learning code, you can simply inherit from ObservationWrapper and overwrite the method observation to implement that transformation.".

In my case the observation returned by the wrapped environment is the following:

spaces.Dict(
    {
        "observation": spaces.Box(),
        "desired_goal": spaces.Box(),
        "achieved_goal": spaces.Box(),
    }
)

This allows the reward computation to work for the original nested Dict observation and goal spaces during normal stepping of the environment when collecting experience. What was not obvious to me was that your HER implementation requires the compute_reward() function to work both for a single sample of 'achieved_goal' and 'desired_goal', as well as for a batch of them. Even more problematic is that this batch contains the flattened goals instead of the original Dict goals.

🤖 Custom Gym Environment

I hadn't before, but I tried now and env_checker does not return any errors when I provide it with the wrapped environment.

### Describe the bug

(see above)

### Code example

Unfortunately the environment requires proprietary back-end simulator, so to provide you with a working example I would have to create a dummy version that results in the same problem.

### System Info

The output of sb3.get_system_info() is the following:

OS: Linux-5.4.0-90-generic-x86_64-with-glibc2.29 #101~18.04.1-Ubuntu
Python: 3.8.10
Stable-Baselines3: 1.6.0
PyTorch: 1.12.0+cu102
GPU Enabled: False
Numpy: 1.23.1
Gym: 0.21.0

Although I actually am running in a Docker container with Ubuntu 20.04.

### Checklist

  • [x] I have read the documentation (required)
  • [x] I have checked that there is no similar issue in the repo (required)
  • [x] I have checked my env using the env checker (required)
  • [ ] I have provided a minimal working example to reproduce the bug (required)

ritalaezza avatar Aug 18 '22 08:08 ritalaezza

then this is not the case for my environment, and I'll have to work around this.

Easiest is to do a for loop (even though it will be slow, as suggested in https://github.com/DLR-RM/stable-baselines3/issues/854). We require that to have a fast implementation.

What was not obvious to me was that your HER implementation requires the compute_reward() function to work both for a single sample of 'achieved_goal' and 'desired_goal', as well as for a batch of them.

yes, I thought it was documented but it is not... we welcome a PR that update the doc ;) (and env checker)

Even more problematic is that this batch contains the flattened goals instead of the original Dict goals.

hmm, but this is expected, no?

I would have to create a dummy version that results in the same problem.

yes, that's what we mean usually by "minimal code example".

araffin avatar Aug 18 '22 09:08 araffin

We require that to have a fast implementation.

I will see if I can change the reward function so that it can be vectorized.

hmm, but this is expected, no?

Well, I guess so. Given that when compute_reward() receives a batch of goals, these are coming from stable-baselines3, which only receives the flattened arrays, not the original nested Dict. I will have to reverse the flattening operation for each sample in the batch to compute the vector of rewards.

yes, that's what we mean usually by "minimal code example".

Maybe once I get my implementation to work, I can make such a code example and even a PR.

ritalaezza avatar Aug 19 '22 06:08 ritalaezza

It is actually documented in #780 (and env checker is updated there).

We should probably cherry-pick those changes.

araffin avatar Oct 02 '22 14:10 araffin