Updates environments to use `gymnasium.vector.VectorEnv`
Description
Currently, the environments use gymnasium.Env even though the implementations are all vectorised-based. This PR updates the implementations to use gymnasium.vector.VectorEnv.
Fixes https://github.com/isaac-sim/IsaacLab/issues/3095
Type of change
- Breaking change (fix or feature that would cause existing functionality to not work as expected)
- This change requires a documentation update
Checklist
- [ ] I have run the
pre-commitchecks with./isaaclab.sh --format - [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have updated the changelog and the corresponding version in the extension's
config/extension.tomlfile - [ ] I have added my name to the
CONTRIBUTORS.mdor my name already exists there
Thanks for the PR. Will we require to switch to gym.make_vec instead of gym.make with this change?
@kellyguo11 Yes and the gym.register should use vector_entry_point for make_vec
Do you know what autoreset mode IsaacLab uses? https://farama.org/Vector-Autoreset-Mode
@pseudo-rnd-thoughts Thanks for submitting this PR, I think IsaacLab uses same-step reset mode!
@pseudo-rnd-thoughts Let me know if there is something I can help, it will be nice to update to gymnasium up-to-date standard : ))
We don't provide terminal_obs in the info dictionary. So it is "Disabled" mode.
ohhh thats right, sorry for the miss, @Mayankm96 Maybe do you think it makes sense to let user choose which mode to enable in IsaacLab side? and we can default to Disabled mode for RL applications, I think if user can change the mode from Disabled to Same-Step or Next-Step, then we could rewrite Recorder Manager not inside ManagerBasedRLEnv, one reason it had to be there was because Disabled Mode doesn't provide terminal_obs, so the only way they can record pre-reset observation is to put in ManagerBasedRLEnv in reset logic.
We don't provide terminal_obs in the info dictionary. So it is "Disabled" mode.
@Mayankm96 Are you sure? The manager_based_rl_env has the following note
Note: For vectorized environments, it is recommended to **only** call the :meth:`reset` method once before the first call to :meth:`step`, i.e. after the environment is created. After that, the :meth:`step` function handles the reset of terminated sub-environments. This is because the simulator does not support resetting individual sub-environments in a vectorized environment.
For the disabled mode, users must call reset to reset each sub-environment.
Would you happen to know where the auto-reset code is implemented?
@ooctipus Could you approve the github actions? I'm unfamiliar with the codebase so I suspect that I've missed a necessary code change and I haven't been able to get the tests to run locally.
@pseudo-rnd-thoughts : I didn't see the reset(masks) is a user-defined call. In that case, the closest we have to what is described is the Same-Step reset. Although we don't provide the final_obs as mentioned above.
I didn't see the reset(masks) is a user-defined call. In that case, the closest we have to what is described is the Same-Step reset. Although we don't provide the final_obs as mentioned above.
Yes, checking the code in DirectRLEnv.step, it looks like you've implemented the same-step autoreset mode as I've described it in that article; however, you don't return the final_obs, as you mentioned.
Is there a reason the implementation doesn't include the episode's final observation, as currently, you get the next episode's first observation? To my understanding, it's necessary to correctly implement most RL algorithms as in the case of truncation (i.e., time-outs), you shouldn't have an episode's final observation. https://farama.org/Gymnasium-Terminated-Truncated-Step-API
| /workspace/isaaclab/source/isaaclab_rl/test/test_rl_games_wrapper.py | FAILED | 0.08 | 0/1 | | /workspace/isaaclab/source/isaaclab_rl/test/test_skrl_wrapper.py | FAILED | 0.09 | 0/1 | | /workspace/isaaclab/source/isaaclab_rl/test/test_sb3_wrapper.py | FAILED | 0.08 | 0/1 | | /workspace/isaaclab/source/isaaclab_rl/test/test_rsl_rl_wrapper.py | FAILED | 0.16 | 0/2 |
seems like few tests are failing, I am letting it run again, if fails again I ll take a look at it
Is there a reason the implementation doesn't include the episode's final observation, as currently, you get the next episode's first observation?
I checked the code for RSL-RL (RL-Games and skrl also do something similar): It seems they use the state before the termination ($s_{T-1}$ instead of $s_T$) for computing the bootstrapped values at truncations.
- SKRL: https://github.com/Toni-SM/skrl/blob/main/skrl/agents/torch/ppo/ppo.py#L297-L299
- RSL-RL: https://github.com/leggedrobotics/rsl_rl/blob/main/rsl_rl/algorithms/ppo.py#L161-L164
While I agree, the final_obs is technically the right thing to do. Empirically, it does not make a difference. Using the values for states (computed in the previous algorithm step) saves an additional forward call of the critic network and provides some performance gain when handling a large batch of environments.
Tagging @Toni-SM since he has a better idea for algorithms other than PPO.
Do we need to switch from gym.make to gym.make_vec calls in the scripts? @ooctipus
Do we need to switch from
gym.maketogym.make_veccalls in the scripts? @ooctipus
I think this could become a breaking change for user environments, not sure if we get much benefit in return?
the largest benefit will probably be for implementing a new rl-lib wrapper, the process could be easier by viewing ManagerBasedEnv or DirectEnv as VectorEnv.
Updating to VectorEnv should make it clearer to users that the environments are vectorised and remove the necessity for libraries to implement their own converter functions.
Further, users will be able to use vector wrappers ()with the environments, which isn't possible with gymnasium.Env implementations.
right now we get around it by wraping IsaacLabEnvs(not gym env) with RLLib-Wrapper , where the rl-libs can directly assume it is interacting with ManagerBasedEnv, or DirectEnv. We basically wrote own converter functions, but even switch to VectorEnv, customized converter functions may still needed because each rllib also have some special requirements that a generic VectorEnv is likely not enough to satisfy.
I guess maybe the largest breaking change is because entry_point -> vector_entry_point? If we applied this change, a user implemented entry_point for their environment will be potentially not get registered properly until they also change it to vector_entry_point.
I am also weighing if change entry_point -> vector_entry_point is needed at all. Since in current IsaacLab implementation, even if num_envs == 1, it is still vectorized as (n, observation_shape), (n, action_shape), n = 1
My general feeling is that if we introduce very minimum breaking change, I think there is still value in adapting to gym's best practice for long term.
Do I understand correctly that currently there is no way to access the true final obs of an episode? Why not just add it to the info dict as you compute it anyway in step?
@noahfarr We don't compute it in the step. It's only computed after the reset is performed for all the environments together. We don't support computing the observations for only a sub-set of environment indices in the vectorized environment.
From what batched RL libraries do: the agent can use the last observations of the previous step as final obs instead of having the environment return this twice. At least for continuous control problems, I think the two obs can be considered fairly close to use an approximation, and I would rather avoid calling compute obs twice for performance reasons.
We have two ways of designing environments. Users who design their environments can compute final_obs in the step function if they need it. But the examples we provide don't do this right now. Very likely, we won't due to reasons above.
A big +1 for moving to gymnasium's vector env and providing a clear AutoresetMode. Some frameworks work around this issue (RSL_RL for sure, SKRL at least for PPO), but others (e.g. torchRL) do not. It would be great if IsaacLab had less footguns when switching learning frameworks.