stable-baselines3
stable-baselines3 copied to clipboard
[Question] Pong environment with A2C not learning with example code
❓ Question
I copied the code from the Examples section in the documentation, which also uses a PongNoFrameskip-v4 environment with 4 stacked frames. The episodic mean reward starts out around -20, but then worsens, after which it fluctuates between -21 and -20.5. I use the default hyperparameters of the A2C CNN policy, as you can also tell from the code below.
vec_env = make_atari_env("PongNoFrameskip-v4", n_envs=4, seed=0)
vec_env = VecFrameStack(vec_env, n_stack=4)
model = A2C("CnnPolicy", vec_env, verbose=1, device='cuda')
model.learn(total_timesteps=10000000)
I'm running this code using Python 3.10.4 and torch 2.3.0. What could be going wrong here, and shouldn't this example code just work?
Checklist
- [X] I have checked that there is no similar issue in the repo
- [X] I have read the documentation
- [X] If code there is, it is minimal and working
- [X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.
if you want the correct hyperparameters for Atari, you should use the RL Zoo. The example in the doc is there to show the api, we kept it concise to focus on the wrappers we provide.