stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Feature Request] An option to collect rollout for n_episoded instead of n_steps

Open CppMaster opened this issue 2 years ago • 2 comments

🚀 Feature

An option to collect rollout for n_episodes instead of n_steps for on policy algorithms.

Motivation

Some environments, like games, have the most important reward at the end of the episode (like win or lose). Length of each episode may wary greatly. Right now, if number of steps is less than length of episode then a rollout may not have the final reward, so it won't be discounted for any transition in the rollout, which would result in this crucial information missing. Making rollouts end with end of n-th episode would make sure that every rollout have n ending rewards which would make training more consistent.

Pitch

Implement an option to collect rollout for n_episodes instead of n_steps for on policy algorithms. PPO("MlpPolicy", env, n_rollout_episodes=1)

CppMaster avatar Sep 10 '22 16:09 CppMaster

Hello, I understand your motivation but this is a special problem that would require too much changes internally (replacing fixed-size numpy rollout buffer by variable size list, this will have impact on speed too) and n_rollout_episodes=1 cannot be ensured for n_envs > 1. However, for your own needs, you can always implement it in a fork of SB3 ;)

A temporary solution is to set n_steps to a large value (and probably reduce n_epochs then).

araffin avatar Sep 11 '22 19:09 araffin

Right now, if number of steps is less than length of episode then a rollout may not have the final reward, so it won't be discounted for any transition in the rollout, which would result in this crucial information missing

this is the role of the value function, and we do use bootstrap with value function in case the rollout is in-between episodes (you can check the GAE code ;))

araffin avatar Sep 11 '22 19:09 araffin