rl-baselines3-zoo
rl-baselines3-zoo copied to clipboard
[Enhancement] Support copying optuna params dict for all hyperparameters
Right now, only hyperparmeters that are searched by default can have their params dict be copied and reused due to naming issues. This should be extended to hyperparameters that are not searched by default, per discussion in issue #115.
only hyperparmeters that are searched by default can have their params dict be copied and reused due to naming issues
well also some params that are searched cannot be copied too.
[related question] Transfer hyperparameters from optuna
For learning purposes I am tuning a number of algorithms for environment 'MountanCar-v0'. At the moment I am interested in PPO. I intend to share tuned hyperparameters working putting them on your repo. I try to understand the working with some depth of a variety of algorithms hands-on. SB3 and zoo are great tools to get hands-on. So I was using optuna from zoo to find the right parameters for PPO, and by the results produced by it I would say that the hyperparameters should work:
I execute as indicated:
train.py --algo ppo --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 --sampler tpe --pruner median
Output:
========== MountainCar-v0 ==========
Seed: 2520733740
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('ent_coef', 0.0),
('gae_lambda', 0.98),
('gamma', 0.99),
('n_envs', 16),
('n_epochs', 4),
('n_steps', 16),
('n_timesteps', 1000000.0),
('normalize', True),
('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=50000
Normalization activated: {'gamma': 0.99}
Optimizing hyperparameters
Sampler: tpe - Pruner: median
Then one nice result is:
Trial 151 finished with value: -95.4 and parameters: {'batch_size': 256, 'n_steps': 32, 'gamma': 0.999, 'learning_rate': 0.00043216809397908225, 'ent_coef': 5.844122887301502e-07, 'clip_range': 0.2, 'n_epochs': 10, 'gae_lambda': 0.92, 'max_grad_norm': 2, 'vf_coef': 0.035882158772375855, 'net_arch': 'medium', 'activation_fn': 'relu'}. Best is trial 151 with value: -95.4.
Normalization activated: {'gamma': 0.99}
Normalization activated: {'gamma': 0.99, 'norm_reward': False}
The environment is solved at -110 reward, following literature.
When passing these hyperparameters to the algorithm it does not work (remains at -200). I do not exactly understand why.
envm = make_vec_env("MountainCar-v0", n_envs=16)
policy_kwargs = dict(activation_fn=th.nn.ReLU, net_arch=[dict(pi=[254,254], vf=[254,254])])
model = PPO("MlpPolicy", envm, verbose=1, batch_size=256, n_steps=2048, gamma=0.9999, learning_rate=0.00043216809397908225, ent_coef= 5.844122887301502e-07, clip_range=0.2, n_epochs=10, gae_lambda=0.92, max_grad_norm=2 ,vf_coef= 0.035882158772375855, policy_kwargs=policy_kwargs )
model.learn(total_timesteps=1000000)
model.save("ppo_mountaincar")
As I read it in the docs, I would say it is supposed to work like that, am I wrong? Should I take something else into account?
When passing these hyperparameters to the algorithm it does not work (remains at -200). I do not exactly understand why.
You are missing the normalization wrapper: envm = VecNormalize(envm, gamma=0.9999)
Note that results may also depends on the random seed (cf. doc and issue https://github.com/DLR-RM/rl-baselines3-zoo/issues/151 )
Thank you!