stable-baselines3
stable-baselines3 copied to clipboard
Training exceeds total_timesteps
❓ Question
Consider this setup:
import stable_baselines3
import gym
from stable_baselines3 import DQN, A2C, PPO
#from sb3_contrib import ARS, TRPO
env = gym.make('MountainCar-v0')
seed = 42
verbose = 1
timesteps = 10_000
DQN("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 9600
A2C("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_000
PPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240
#ARS("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 12_800
#TRPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240
The problem is, that some of the agents train for (much) more time steps than specified. This changes depending on the number of timesteps set, e.g., DQN trains for exactly 100_000 time steps if specified. DQN often seems to train for fewer steps and PPO for more steps than specified.
This behavior is also bad when using the EvaluationCallback, because for some algorithms we do more (time consuming) evaluations than requested and for DQN we miss the last evaluation(s).
The question was asked before, here but no real solution was provided. Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change). I also tested this for different gym environments, but the problem persists.
What is the reason for this behavior? Can it be changed?
Checklist
- [X] I have checked that there is no similar issue in the repo
- [X] I have read the documentation
- [X] If code there is, it is minimal and working
- [X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.
Hello,
Related to https://github.com/DLR-RM/stable-baselines3/issues/1059, probably duplicate of https://github.com/DLR-RM/stable-baselines3/issues/457
It is because of how the algorithms work. For short:
- PPO/A2C and derivates collect
n_steps * n_envsof experience before performing an update, so if you want to have exactlytotal_timestepsyou will need to adjust those values - SAC/DQN/TD3 and other off-policy algorithms collect
train_freq * n_envssteps before performing an update (when train freq is in steps), so if you want to have exactlytotal_timestepsyou will need to adjust those values (train_freq=4by default for DQN) - ARS and other population based algorithms evaluate the policy for
n_episodeswithn_envs, so unless the number of steps per episode is fixed, it is not possible to exactly achievetotal_timesteps - when using multiple envs, each call to
env.step()corresponds ton_envstimesteps, so it is no longer possible to use theEvaluationCallbackat an exact timestep
for DQN we miss the last evaluation(s).
this sounds more like a bug, could you provide a minimal example to reproduce that issue
Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change)
this is for plotting or when you don't want to perform a reset when calling learn() multliple times