stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

Training exceeds total_timesteps

Open timosturm opened this issue 3 years ago • 1 comments

❓ Question

Consider this setup:

import stable_baselines3
import gym
from stable_baselines3 import DQN, A2C, PPO
#from sb3_contrib import ARS, TRPO

env = gym.make('MountainCar-v0')

seed = 42
verbose = 1
timesteps = 10_000

DQN("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 9600
A2C("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_000
PPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240
#ARS("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 12_800
#TRPO("MlpPolicy", env, verbose=verbose, seed=seed).learn(total_timesteps=timesteps) # 10_240

The problem is, that some of the agents train for (much) more time steps than specified. This changes depending on the number of timesteps set, e.g., DQN trains for exactly 100_000 time steps if specified. DQN often seems to train for fewer steps and PPO for more steps than specified.

This behavior is also bad when using the EvaluationCallback, because for some algorithms we do more (time consuming) evaluations than requested and for DQN we miss the last evaluation(s).

The question was asked before, here but no real solution was provided. Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change). I also tested this for different gym environments, but the problem persists.

What is the reason for this behavior? Can it be changed?

Checklist

  • [X] I have checked that there is no similar issue in the repo
  • [X] I have read the documentation
  • [X] If code there is, it is minimal and working
  • [X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

timosturm avatar Nov 02 '22 17:11 timosturm

Hello,

Related to https://github.com/DLR-RM/stable-baselines3/issues/1059, probably duplicate of https://github.com/DLR-RM/stable-baselines3/issues/457

It is because of how the algorithms work. For short:

  • PPO/A2C and derivates collect n_steps * n_envs of experience before performing an update, so if you want to have exactly total_timesteps you will need to adjust those values
  • SAC/DQN/TD3 and other off-policy algorithms collect train_freq * n_envs steps before performing an update (when train freq is in steps), so if you want to have exactly total_timesteps you will need to adjust those values (train_freq=4 by default for DQN)
  • ARS and other population based algorithms evaluate the policy for n_episodes with n_envs, so unless the number of steps per episode is fixed, it is not possible to exactly achieve total_timesteps
  • when using multiple envs, each call to env.step() corresponds to n_envs timesteps, so it is no longer possible to use the EvaluationCallback at an exact timestep

for DQN we miss the last evaluation(s).

this sounds more like a bug, could you provide a minimal example to reproduce that issue

Also setting reset_num_timesteps=False does not change anything (and I am not sure what it is supposed to change)

this is for plotting or when you don't want to perform a reset when calling learn() multliple times

araffin avatar Nov 03 '22 13:11 araffin