stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Feature Request] Allow users to define gradient steps as a fraction of rollout time-steps

Open janakact opened this issue 1 year ago • 1 comments
trafficstars

🚀 Feature

Currently, SB3 algorithms allow you to define the number of gradient steps $= -1$, which will translate into the number of timesteps in the rollout, let's call it $k$.

However, allowing the user to define gradient_steps $=-2, -4$ or $-8$ which translates into consecutively $k/2, k/4, k/8$ gradient steps, will be more beneficial. This will allow users to write simpler hyper-parameter tuning code as well.

Implementation is pretty simple, change off-policy-algorithm L344:

gradient_steps = self.gradient_steps if self.gradient_steps >= 0 else rollout.episode_timesteps
# should be changed to:
gradient_steps = self.gradient_steps if self.gradient_steps >= 0 else rollout.episode_timesteps/(-self.gradient_steps)

Motivation

Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)

Pitch

This wouldn't break existing functionalities, and the code change is pretty simple.

Alternatives

No response

Additional context

No response

Checklist

  • [X] I have checked that there is no similar issue in the repo
  • [X] If I'm requesting a new feature, I have proposed alternatives

janakact avatar May 06 '24 06:05 janakact

hello, please don't forget about alternatives too (you ticked the box).

araffin avatar May 06 '24 06:05 araffin

@araffin sorry, I couldn't think of alternatives. Apologies for the mistake. I unticked the checkbox.

janakact avatar May 23 '24 08:05 janakact

Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)

the code in the RL Zoo is only one line of code, I'm not sure how much it would simplify things.

However, allowing the user to define gradient_steps or which translates into consecutively

I have to say using -2 to mean /2 seems rather unintuitive (-1 is a special case as it is intentionally an invalid value).

Alternatives

As you pointed out:

n_steps = train_freq * n_envs
sub_factor = 2
gradient_steps = max(n_steps // sub_factor, 1)

seems a pretty simple and flexible alternative

araffin avatar May 23 '24 09:05 araffin

An alternative way to solve this is similar to scikit-learn's train_test_split that uses the data type:

  • int -> number of steps
  • float -> fraction of the data

Though for sklearn, they don't have the issue of 2.0*train_freq and 2 gradient steps both being valid options.

swierh avatar Jun 19 '24 14:06 swierh