stable-baselines3 Add Gym 0.25 support

See comment https://github.com/DLR-RM/stable-baselines3/pull/780#issuecomment-1164774401 to use this PR

Description

Gym 0.24 has been released and with it breaking changes. The objective of this PR is to fix all the failing tests.

Missing:

[x] update changelog
[x] update notebooks and check it works with gym 0.24 (https://github.com/Stable-Baselines-Team/rl-colab-notebooks/pull/9)
[ ] update RL Zoo and check it works with gym 0.24 (https://github.com/DLR-RM/rl-baselines3-zoo/pull/256, not ready, gym needs to be fixed first)
[x] update tutorials (https://github.com/araffin/rl-tutorial-jnrr19/pull/13)
[x] prepare docker image
[x] push docker images
[x] update SB3 contrib (https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/71)

Motivation and Context

Gym release notes

Types of changes

[X] Bug fix (non-breaking change which fixes an issue)

closes #840 #871 closes #271

Checklist:

[X] I've read the CONTRIBUTION guide (required)
[ ] I have updated the changelog accordingly (required).
[ ] My change requires a change to the documentation.
[ ] I have updated the tests accordingly (required for a bug fix or a new feature).
[ ] I have updated the documentation accordingly.
[X] I have reformatted the code using make format (required)
[X] I have checked the codestyle using make check-codestyle and make lint (required)
[ ] I have ensured make pytest and make type both pass. (required)
[ ] I have checked that the documentation builds using make doc (required)

Status

There's still 28 failing tests when running locally, need to investigate further

Feb 19 '22 23:02 carlosluis

it seems that the failures come from pygame not being installed somehow, probably an issue from gym...

Feb 21 '22 16:02 araffin

it seems that the failures come from pygame not being installed somehow, probably an issue from gym...

Only a subset of the failures, those are fixed now.

Feb 21 '22 19:02 carlosluis

@RedTachyon I'm currently investigating a bug which may be related to https://github.com/openai/gym/pull/2422.

Here's a stack trace of the type of errors:

tests/test_save_load.py:216: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
stable_baselines3/dqn/dqn.py:258: in learn
    return super(DQN, self).learn(
stable_baselines3/common/off_policy_algorithm.py:354: in learn
    rollout = self.collect_rollouts(
stable_baselines3/common/off_policy_algorithm.py:584: in collect_rollouts
    actions, buffer_actions = self._sample_action(learning_starts, action_noise, env.num_envs)
stable_baselines3/common/off_policy_algorithm.py:415: in _sample_action
    unscaled_action, _ = self.predict(self._last_obs, deterministic=False)
stable_baselines3/dqn/dqn.py:238: in predict
    action = np.array([self.action_space.sample() for _ in range(n_batch)])
stable_baselines3/dqn/dqn.py:238: in <listcomp>
    action = np.array([self.action_space.sample() for _ in range(n_batch)])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Discrete(10)

    def sample(self) -> int:
>       return self.start + self.np_random.randint(self.n)
E       AttributeError: 'numpy.random._generator.Generator' object has no attribute 'randint'

I've been digging into the code and the problem seems to be that this line

def sample(self) -> int:
        return self.start + self.np_random.randint(self.n)

is being called with self.np_random of type numpy.random._generator.Generator, instead of gym.utils.seeding.RandomNumberGenerator. The former doesn't have a method randint so the code fails.

If it's any help, this is only happening after loading a model, so the problem may be related to it (I haven't checked much on that front yet, wanted to rule out something obvious with the changes on seeding first). Here's an example test were it fails https://github.com/DLR-RM/stable-baselines3/blob/58a98060f9ec618a740f121c5904ee845fd02f15/tests/test_save_load.py#L216

It seems discrete.py in Gym assumes np_random will always be of some type but doesn't enforce it explicitly? Or there's something funny happening with the loading...

Feb 21 '22 22:02 carlosluis

I think I found the issue - the custom RNG class inherits from the numpy Generator for compatibility purposes, but when it gets pickled/unpickled, it defaults to the numpy behavior which creates a new Generator.

I think I see two main ways to solve it - either in __setstate__ of Space make sure that a RNG is set instead of a Generator; or in the RNG code do some custom code to pickle/unpickle the proper class

Feb 22 '22 10:02 RedTachyon

I think I found the issue - the custom RNG class inherits from the numpy Generator for compatibility purposes, but when it gets pickled/unpickled, it defaults to the numpy behavior which creates a new Generator.

I think I see two main ways to solve it - either in __setstate__ of Space make sure that a RNG is set instead of a Generator; or in the RNG code do some custom code to pickle/unpickle the proper class

I see, thanks for checking. I wonder what this means for SB3 adopting Gym 0.22, since the problem seems to be on Gym's side. I'm guessing the fix (whenever it comes) won't be available for SB3 until the next release by Gym?

Feb 22 '22 18:02 carlosluis

won't be available for SB3 until the next release by Gym?

https://github.com/openai/gym/issues/2640 must be fixed first too

Feb 23 '22 15:02 araffin

We're going to do a 0.22.1 release in the new future that will fix all of these things, some of the relevant fixes here have already been merged

Feb 23 '22 19:02 jkterry1

Update on this:

Confirmed that https://github.com/openai/gym/pull/2639 solves the bug I mentioned before, so in the next Gym release those tests will pass.
There's three remaining tests failing, two separate bugs

Mean reward is below desired threshold.

FAILED tests/test_identity.py::test_discrete[env2-A2C] - AssertionError: Mean reward below threshold: 88...
FAILED tests/test_identity.py::test_discrete[env2-PPO] - AssertionError: Mean reward below threshold: 86...

Not sure exactly how gym's updates are causing this lower mean reward, need to check this further

Warning not raised in HER test

FAILED tests/test_her.py::test_save_load_replay_buffer[True-False] - assert 0 == 1

Caused because the trajectory truncation is not raising a warning in these lines:

https://github.com/DLR-RM/stable-baselines3/blob/cdaa9ab418aec18f41c7e8e12e0ad28f238553eb/tests/test_her.py#L258-L261

EDIT: CI also shows about 3 tests failing with EOFError, and I also observed those locally, but they seem to be also solved when using gym master (tested locally).

Mar 03 '22 20:03 carlosluis

Regarding the "mean reward below threshold" problem, after some further investigations the root cause is the change in seeding behaviour in gym.

SB3 sets seeds here: https://github.com/DLR-RM/stable-baselines3/blob/e88eb1c9ca98650f802409e5845e952c39be9e76/stable_baselines3/common/base_class.py#L576

Before gym 0.22.0, the default seed() method was not actually setting any seed [source]. In gym 0.22 the default seed() does set the seed [source].

The consequence of it is, for instance, that the result of https://github.com/DLR-RM/stable-baselines3/blob/e88eb1c9ca98650f802409e5845e952c39be9e76/stable_baselines3/common/evaluation.py#L82 is not consistent between gym==0.21.0 and gym >= 0.22.0. The solution I say makes the most sense is to simply adjust the threshold value in the test.

Mar 08 '22 20:03 carlosluis

Before gym 0.22.0, the default seed() method was not actually setting any seed [source]. In gym 0.22 the default seed() does set the seed [source].

I see...

is not consistent between gym==0.21.0 and gym >= 0.22.0.

is there any way to make it consistent?

The solution I say makes the most sense is to simply adjust the threshold value in the test.

The whole point of those performance tests is to detect any change that may injure performance/change results. (they fail easily as soon as any minor change is made) So we should not change the threshold but rather fix the underlying change/issue, or in the current case, I would prefer changing the seed if we cannot have consistent behavior with previous version.

btw, are all those warnings (see below) due to gym only?

gym/utils/seeding.py:47: 1 warning
tests/test_callbacks.py: 4153 warnings
tests/test_cnn.py: 377 warnings
tests/test_custom_policy.py: 106 warnings
tests/test_deterministic.py: 238 warnings
tests/test_dict_env.py: 14116 warnings
tests/test_envs.py: 2 warnings
tests/test_her.py: 1566 warnings
tests/test_identity.py: 18153 warnings
tests/test_logger.py: 300 warnings
tests/test_monitor.py: 3000 warnings
tests/test_predict.py: 109 warnings
tests/test_run.py: 222 warnings
tests/test_save_load.py: 6597 warnings
tests/test_spaces.py: 1110 warnings
tests/test_train_eval_mode.py: 31 warnings
tests/test_utils.py: 12 warnings
tests/test_vec_envs.py: 216 warnings
tests/test_vec_monitor.py: 4000 warnings
tests/test_vec_normalize.py: 6547 warnings
  /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/gym/utils/seeding.py:47: DeprecationWarning: WARN: Function `rng.randint(low, [high, size, dtype])` is marked as deprecated and will be removed in the future. Please use `rng.integers(low, [high, size, dtype])` instead.

same for

tests/test_envs.py::test_non_default_action_spaces[new_action_space4]
  /home/runner/work/stable-baselines3/stable-baselines3/stable_baselines3/common/env_checker.py:272: UserWarning: We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) cf https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
    warnings.warn(

tests/test_utils.py: 16 warnings
  /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/gym/utils/seeding.py:138: DeprecationWarning: WARN: Function `hash_seed(seed, max_bytes)` is marked as deprecated and will be removed in the future. 
    deprecation(

tests/test_utils.py: 16 warnings
  /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/gym/utils/seeding.py:175: DeprecationWarning: WARN: Function `_bigint_from_bytes(bytes)` is marked as deprecated and will be removed in the future. 
    deprecation(

?

Mar 08 '22 20:03 araffin

Similarly, the HER test failure was also caused by the change in RNG. The failing test required that after training for 200 steps, the env had to be left in the middle of a trajectory. After the change in gym, the RNG gods decided that after 200 steps the env had just finished an episode, i.e., self.current_idx == 0 in

https://github.com/DLR-RM/stable-baselines3/blob/e88eb1c9ca98650f802409e5845e952c39be9e76/stable_baselines3/her/her_replay_buffer.py#L523

==> warning is never raised ==> test assertion fails.

A simple change in the seed fixes the test.

Mar 08 '22 20:03 carlosluis

@araffin I would have to check the warnings, but now every time env.seed is called will raise a warning.

I reverted the threshold back to 90 and changing the seed did the trick.

Mar 08 '22 21:03 carlosluis

but now every time env.seed is called will raise a warning.

Could we make the VecEnv seed method backward compatible? In the sense that if seed is accepted in reset() it calls reset(seed=seed), otherwise it call seed(seed) when venv.seed(seed) is called. This would prevent most warnings normally.

Mar 08 '22 21:03 araffin

@carlosluis

Regarding the "mean reward below threshold" problem, after some further investigations the root cause is the change in seeding behaviour in gym.

Can you elaborate on how you got to this conclusion? Something seems very weird here, since this would imply that the test used to be stochastic (and happened to always pass), and now is deterministic, but the seed happens to fail. What's more, the change you mention affects env.np_random, which from what I can tell isn't used in the actual test

Mar 08 '22 23:03 RedTachyon

Further investigations on these:

@RedTachyon is right, the determinism in the test (both using gym v0.21.0 and current master) actually comes from setting the seed on the action space rather than the environment, as done in this line: https://github.com/DLR-RM/stable-baselines3/blob/e88eb1c9ca98650f802409e5845e952c39be9e76/stable_baselines3/common/base_class.py#L574

So after further inspection I have reason to believe the difference in test results between using gym 0.21.0 and gym>=0.22.0 comes from a change in behaviour of the np_random method. Compare v0.21.0 and the current master.

The evidence I have is that immediately after setting the action space seed in the line above, we get a different output on the first call function to action_space.sample() here: https://github.com/DLR-RM/stable-baselines3/blob/e88eb1c9ca98650f802409e5845e952c39be9e76/stable_baselines3/common/envs/identity_env.py#L49

For example, with seed=2 I get:

v0.21.0 results in array([0, 0, 1, 0], dtype=int8)
master results in array([1, 0, 0, 1], dtype=int8)

Mar 09 '22 12:03 carlosluis

The difference is probably that the underlying RNG mechanism changed somewhere between releases, so the effect of setting an old random seed won't be the same.

I'd be super uncomfy having a test that depends so heavily on the random seed, but if that's just to check for consistency, searching for a random seed that works again would probably be good enough.

Mar 09 '22 12:03 RedTachyon

@carlosluis i think there was a misunderstanding with my remark here: https://github.com/DLR-RM/stable-baselines3/pull/780#issuecomment-1062235112

please do not change VecEnv signature but make VecEnv seed() method forward compatible (so it supports all gym versions).

Mar 16 '22 20:03 araffin

@carlosluis i think there was a misunderstanding with my remark here: #780 (comment)

please do not change VecEnv signature but make VecEnv seed() method forward compatible (so it supports all gym versions).

@araffin I see, yes I understood something different. AFAIK, to make VecEnv seed() forward compatible we should internally call reset(seed=seed) for every env, similar to how it's now done in DummyVecEnv.

def seed(self, seed: Optional[int] = None) -> List[Union[None, int]]:
        seeds = list()
        for idx, env in enumerate(self.envs):
              seeds.append(env.reset(seed=seed + idx))
        return seeds

but I'm not sure how to do that from the base class VecEnv, it doesn't have access to the individual environments right?

Mar 16 '22 21:03 carlosluis

AFAIK, to make VecEnv seed() forward compatible we should internally call reset(seed=seed) for every env, similar to how it's now done in DummyVecEnv.

yes

but I'm not sure how to do that from the base class VecEnv

you don't do it in the base class, but rather in DummyVecEnv and in SubprocVecEnv (in the worker) and that should do the trick ;)

Mar 17 '22 10:03 araffin

AFAIK, to make VecEnv seed() forward compatible we should internally call reset(seed=seed) for every env, similar to how it's now done in DummyVecEnv.

yes

but I'm not sure how to do that from the base class VecEnv

you don't do it in the base class, but rather in DummyVecEnv and in SubprocVecEnv (in the worker) and that should do the trick ;)

Got it! I already have that mostly, I will remove the signature change in the base class and also make the SubprocVecEnv worker change.

Mar 17 '22 10:03 carlosluis

@araffin As I was going through the commit diff to document the changes, I realized that maybe removing the GoalEnv class is incompatible with HER? If that's the case, then we would have to do as you said before and bring the class from https://github.com/Farama-Foundation/Gym-Robotics

In the docs for HER we have:

.. warning::

    HER requires the environment to inherits from `gym.GoalEnv <https://github.com/openai/gym/blob/3394e245727c1ae6851b504a50ba77c73cd4c65b/gym/core.py#L160>`_

I don't find where specifically such a warning is raised, but there's definitely a codepath that expects to find the compute_reward() method enforced by the GoalEnv interface.

https://github.com/DLR-RM/stable-baselines3/blob/30772aa9f53a4cf61571ee90046cdc454c1b11d7/stable_baselines3/her/her_replay_buffer.py#L346-L362

This also impacts the HER colab example in https://github.com/DLR-RM/stable-baselines3/blob/master/docs/guide/examples.rst

Lastly, the highway env repo went through a similar process to start using GoalEnv from gym robotics but ultimately they dropped the dependency and copied the class instead, see https://github.com/eleurent/highway-env/commit/f8a4c633012ee004786346908e3f9b841edd64aa. Maybe we need to keep that in mind when discussing what to do about this issue.

Mar 26 '22 09:03 carlosluis

I don't find where specifically such a warning is raised, but there's definitely a codepath that expects to find the compute_reward() method enforced by the GoalEnv interface.

well, the two features that we use are:

we expect a dict obs with three keys observation, achieved_goal and desired_goal
we expect a vectorized implementation of compute_reward()

Mar 28 '22 08:03 araffin

I don't find where specifically such a warning is raised, but there's definitely a codepath that expects to find the compute_reward() method enforced by the GoalEnv interface.

well, the two features that we use are:
* we expect a dict obs with three keys `observation`, `achieved_goal` and `desired_goal`

* we expect a vectorized implementation of `compute_reward()`

I wanted to check if there's something to be done in SB3 regarding the removal of GoalEnv from gym, in which case we should probably address it in this PR.

Mar 28 '22 20:03 carlosluis

I wanted to check if there's something to be done in SB3 regarding the removal of GoalEnv from gym, in which case we should probably address it in this PR.

Those checks are now missing: https://github.com/openai/gym/blob/v0.21.0/gym/core.py#L182 and need to be added to the env checker at least (maybe somewhere else if it's easy to do).

And the doc must be updated to explain what is expected for HER replay buffer (because we cannot refer to GoalEnv interface anymore).

EDIT: before I forget, I will probably add backward compat for render modes in the vec video recorder.

Mar 30 '22 09:03 araffin

@araffin the ALE related warnings about seeding.hash_seed and rng.randint are now fixed in ale-py==0.7.5.

Apr 18 '22 19:04 JesseFarebro

@araffin I would like to help getting this PR merged. Are you or anyone else working on the missing points? If not, could I work on PRs to the connected repos (zoo, notebooks etc.)?

Apr 27 '22 06:04 arjun-kg

@araffin I would like to help getting this PR merged. Are you or anyone else working on the missing points? If not, could I work on PRs to the connected repos (zoo, notebooks etc.)?

thanks for the help =) currently no one is working on the missing points, please go ahead!

Apr 27 '22 06:04 araffin

@araffin regarding the checklist,

Tutorials, Notebooks - Created PRs.
Zoo - Tests are working with sb3 at this fork, and sb3-contrib at current master (however sb3 contrib tests fail due to using gym.GoalEnv, but that does not affect zoo, maybe I can separately work on fixing that). Only need to change gym version in requirements and changelog which can be done along with next SB3 release.

I think that's all I can do outside this PR. Please let me know if this is okay.

May 09 '22 21:05 arjun-kg

2. Zoo - Tests are working with sb3 at this fork,

what test did you run for the zoo?

by running make pytest with this branch, I already get at least one test failing. (same for make check-trained-agents)

EDIT: the issue seems to come from dependencies that need to be updated

May 16 '22 15:05 araffin

what test did you run for the zoo?

I had run make pytest and everything had passed for me. I pulled recent commits and tried again now, everything was still passing.

I had missed make check-trained-agents. I ran this now. This had errors due to - PandaEnvs (because of removal of gym.GoalEnv and panda-gym is not updated to reflect this). I'll create a PR there.

May 16 '22 18:05 arjun-kg