stable-baselines3
stable-baselines3 copied to clipboard
[Question] HER applied on GoalEnv with ObservationWrapper
Question
I am using stable-baselines3's implementation of HER with a custom environment, but I ran into problems in the reward computation step. The Gym environment is based on GoalEnv
, but the observations consist of nested Dict
spaces. To overcome this limitation, I flatten the contents of 'observation'
, 'desired_goal'
and 'achieved_goal'
into a single Box space, using an ObservationWrapper
. This works fine when collecting data, but when the new goals (based on n_sampled_goal
) are sampled, the compute_reward()
function does not receive the 'desired_goal'
and 'achieved_goal'
in the correct Dict format, and gets a batch of flattened arrays instead.
Is there a way to overcome this problem?
Additional context
The custom Gym environment wraps the compute_reward()
function around an abstract class to allow for simple exchange of reward functions.
Checklist
- [x] I have read the documentation (required)
- [x] I have checked that there is no similar issue in the repo (required)
Hello, could you elaborate a little bit (and follow the custom env issue template by providing a minimal code example to reproduce the issue). Did you check your env using the env checker?
in the correct Dict format, and gets a batch of flattened arrays instead.
the signature of compute_reward()
is env.compute_reward(obs["achieved_goal"], obs["desired_goal"], info)
and we require compute_reward()
to be vectorized (taking a batch of goals as input).
compute_reward() function does not receive the 'desired_goal' and 'achieved_goal' in the correct Dict format, and gets a batch of flattened arrays instead.
It seems pretty obvious since you
flatten the contents of 'observation', 'desired_goal' and 'achieved_goal' into a single Box space
For HER to work, the observation space must be
spaces.Dict(
{
"observation": ...,
"desired_goal": ...,
"achieved_goal": ...
}
)
Hello,
Thank you for a speedy reply. I guess you just answered my question. If stable-baselines3 requires compute_reward()
to be vectorized (taking a batch of goals as input), then this is not the case for my environment, and I'll have to work around this.
Currently the reward function processes the contents of the original nested Dict structures of 'achieved_goal'
and 'desired_goal'
, not the flattened versions being passed to the learning code. This is the expected behavior of an ObservationWrapper
:
Gym Documentation: "If you would like to apply a function to the observation that is returned by the base environment before passing it to learning code, you can simply inherit from ObservationWrapper
and overwrite the method observation to implement that transformation.".
In my case the observation returned by the wrapped environment is the following:
spaces.Dict(
{
"observation": spaces.Box(),
"desired_goal": spaces.Box(),
"achieved_goal": spaces.Box(),
}
)
This allows the reward computation to work for the original nested Dict observation and goal spaces during normal stepping of the environment when collecting experience. What was not obvious to me was that your HER implementation requires the compute_reward()
function to work both for a single sample of 'achieved_goal'
and 'desired_goal'
, as well as for a batch of them. Even more problematic is that this batch contains the flattened goals instead of the original Dict goals.
🤖 Custom Gym Environment
I hadn't before, but I tried now and env_checker
does not return any errors when I provide it with the wrapped environment.
### Describe the bug
(see above)
### Code example
Unfortunately the environment requires proprietary back-end simulator, so to provide you with a working example I would have to create a dummy version that results in the same problem.
### System Info
The output of sb3.get_system_info()
is the following:
OS: Linux-5.4.0-90-generic-x86_64-with-glibc2.29 #101~18.04.1-Ubuntu
Python: 3.8.10
Stable-Baselines3: 1.6.0
PyTorch: 1.12.0+cu102
GPU Enabled: False
Numpy: 1.23.1
Gym: 0.21.0
Although I actually am running in a Docker container with Ubuntu 20.04.
### Checklist
- [x] I have read the documentation (required)
- [x] I have checked that there is no similar issue in the repo (required)
- [x] I have checked my env using the env checker (required)
- [ ] I have provided a minimal working example to reproduce the bug (required)
then this is not the case for my environment, and I'll have to work around this.
Easiest is to do a for loop (even though it will be slow, as suggested in https://github.com/DLR-RM/stable-baselines3/issues/854). We require that to have a fast implementation.
What was not obvious to me was that your HER implementation requires the compute_reward() function to work both for a single sample of 'achieved_goal' and 'desired_goal', as well as for a batch of them.
yes, I thought it was documented but it is not... we welcome a PR that update the doc ;) (and env checker)
Even more problematic is that this batch contains the flattened goals instead of the original Dict goals.
hmm, but this is expected, no?
I would have to create a dummy version that results in the same problem.
yes, that's what we mean usually by "minimal code example".
We require that to have a fast implementation.
I will see if I can change the reward function so that it can be vectorized.
hmm, but this is expected, no?
Well, I guess so. Given that when compute_reward()
receives a batch of goals, these are coming from stable-baselines3, which only receives the flattened arrays, not the original nested Dict. I will have to reverse the flattening operation for each sample in the batch to compute the vector of rewards.
yes, that's what we mean usually by "minimal code example".
Maybe once I get my implementation to work, I can make such a code example and even a PR.
It is actually documented in #780 (and env checker is updated there).
We should probably cherry-pick those changes.