stable-baselines3
stable-baselines3 copied to clipboard
[Feature Request] Allow users to define gradient steps as a fraction of rollout time-steps
🚀 Feature
Currently, SB3 algorithms allow you to define the number of gradient steps $= -1$, which will translate into the number of timesteps in the rollout, let's call it $k$.
However, allowing the user to define gradient_steps $=-2, -4$ or $-8$ which translates into consecutively $k/2, k/4, k/8$ gradient steps, will be more beneficial. This will allow users to write simpler hyper-parameter tuning code as well.
Implementation is pretty simple, change off-policy-algorithm L344:
gradient_steps = self.gradient_steps if self.gradient_steps >= 0 else rollout.episode_timesteps
# should be changed to:
gradient_steps = self.gradient_steps if self.gradient_steps >= 0 else rollout.episode_timesteps/(-self.gradient_steps)
Motivation
Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)
Pitch
This wouldn't break existing functionalities, and the code change is pretty simple.
Alternatives
No response
Additional context
No response
Checklist
- [X] I have checked that there is no similar issue in the repo
- [X] If I'm requesting a new feature, I have proposed alternatives
hello, please don't forget about alternatives too (you ticked the box).
@araffin sorry, I couldn't think of alternatives. Apologies for the mistake. I unticked the checkbox.
Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)
the code in the RL Zoo is only one line of code, I'm not sure how much it would simplify things.
However, allowing the user to define gradient_steps or which translates into consecutively
I have to say using -2 to mean /2 seems rather unintuitive (-1 is a special case as it is intentionally an invalid value).
Alternatives
As you pointed out:
n_steps = train_freq * n_envs
sub_factor = 2
gradient_steps = max(n_steps // sub_factor, 1)
seems a pretty simple and flexible alternative
An alternative way to solve this is similar to scikit-learn's train_test_split that uses the data type:
- int -> number of steps
- float -> fraction of the data
Though for sklearn, they don't have the issue of 2.0*train_freq and 2 gradient steps both being valid options.