rl-baselines3-zoo icon indicating copy to clipboard operation
rl-baselines3-zoo copied to clipboard

[Enhancement] Support copying optuna params dict for all hyperparameters

Open jkterry1 opened this issue 3 years ago • 4 comments

Right now, only hyperparmeters that are searched by default can have their params dict be copied and reused due to naming issues. This should be extended to hyperparameters that are not searched by default, per discussion in issue #115.

jkterry1 avatar Jun 21 '21 15:06 jkterry1

only hyperparmeters that are searched by default can have their params dict be copied and reused due to naming issues

well also some params that are searched cannot be copied too.

araffin avatar Jun 23 '21 09:06 araffin

[related question] Transfer hyperparameters from optuna

For learning purposes I am tuning a number of algorithms for environment 'MountanCar-v0'. At the moment I am interested in PPO. I intend to share tuned hyperparameters working putting them on your repo. I try to understand the working with some depth of a variety of algorithms hands-on. SB3 and zoo are great tools to get hands-on. So I was using optuna from zoo to find the right parameters for PPO, and by the results produced by it I would say that the hyperparameters should work:

I execute as indicated: train.py --algo ppo --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 --sampler tpe --pruner median

Output:

  ========== MountainCar-v0 ==========
  Seed: 2520733740
  Default hyperparameters for environment (ones being tuned will be overridden):
  OrderedDict([('ent_coef', 0.0),
  ('gae_lambda', 0.98),
  ('gamma', 0.99),
  ('n_envs', 16),
  ('n_epochs', 4),
  ('n_steps', 16),
  ('n_timesteps', 1000000.0),
  ('normalize', True),
  ('policy', 'MlpPolicy')])
  Using 16 environments
  Overwriting n_timesteps with n=50000
  Normalization activated: {'gamma': 0.99}
  Optimizing hyperparameters
  Sampler: tpe - Pruner: median

Then one nice result is:

  Trial 151 finished with value: -95.4 and parameters: {'batch_size': 256, 'n_steps': 32, 'gamma': 0.999, 'learning_rate': 0.00043216809397908225, 'ent_coef': 5.844122887301502e-07, 'clip_range': 0.2, 'n_epochs': 10, 'gae_lambda': 0.92, 'max_grad_norm': 2, 'vf_coef': 0.035882158772375855, 'net_arch': 'medium', 'activation_fn': 'relu'}. Best is trial 151 with value: -95.4.
  Normalization activated: {'gamma': 0.99}
  Normalization activated: {'gamma': 0.99, 'norm_reward': False}

The environment is solved at -110 reward, following literature.

When passing these hyperparameters to the algorithm it does not work (remains at -200). I do not exactly understand why.

envm = make_vec_env("MountainCar-v0", n_envs=16)
policy_kwargs = dict(activation_fn=th.nn.ReLU, net_arch=[dict(pi=[254,254], vf=[254,254])])
model = PPO("MlpPolicy", envm, verbose=1, batch_size=256, n_steps=2048, gamma=0.9999, learning_rate=0.00043216809397908225, ent_coef= 5.844122887301502e-07, clip_range=0.2, n_epochs=10, gae_lambda=0.92, max_grad_norm=2 ,vf_coef= 0.035882158772375855, policy_kwargs=policy_kwargs )

model.learn(total_timesteps=1000000)
model.save("ppo_mountaincar")

As I read it in the docs, I would say it is supposed to work like that, am I wrong? Should I take something else into account?

IlonaAT avatar Sep 15 '21 07:09 IlonaAT

When passing these hyperparameters to the algorithm it does not work (remains at -200). I do not exactly understand why.

You are missing the normalization wrapper: envm = VecNormalize(envm, gamma=0.9999)

Note that results may also depends on the random seed (cf. doc and issue https://github.com/DLR-RM/rl-baselines3-zoo/issues/151 )

araffin avatar Sep 15 '21 09:09 araffin

Thank you!

IlonaAT avatar Sep 15 '21 14:09 IlonaAT