rl-baselines3-zoo
rl-baselines3-zoo copied to clipboard
[Question] Results vastly different for an agent created with Stable Baselines3 using hyperparameters optimized in RL Baselines3 Zoo.
❓ Question
Hello, I first optimize A2C on 1mln steps using RL Baselines3 Zoo:
Firstly i have changed a2c.yml
in RL Baselines3 Zoo to work with RAM version of Seaquest:
atari:
policy: 'MlpPolicy'
n_envs: 16
policy_kwargs: "dict(optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))"
Then wrote command:
python -m train --algo a2c --env ALE/Seaquest-ram-v5 -n 1000000 -optimize --n-trials 100 --n-startup-trials 10
--sampler tpe --pruner median --n-evaluations 4 --n-eval-envs 16 --storage "some_valid_database" --study-name test
Top 3 results:
Then using for example these hyperparameters:
and using this code:
def linear_decay_lr(progress_remaining):
return 0.00027232300584036946 * progress_remaining
if __name__ == "__main__":
vec_env = make_vec_env("ALE/Seaquest-ram-v5", n_envs=16)
model = A2C(
"MlpPolicy",
vec_env,
learning_rate=linear_decay_lr,
n_steps=256,
gamma=0.999,
gae_lambda=0.98,
ent_coef=0.00001753537605091099,
vf_coef=0.19195701505334234,
max_grad_norm=0.5,
use_rms_prop=True,
normalize_advantage=False,
verbose=1,
tensorboard_log="./seaquest/107",
policy_kwargs=dict(activation_fn=torch.nn.Tanh, net_arch=dict(pi=[256, 256], vf=[256, 256]), ortho_init=True,
optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))
)
model.learn(total_timesteps=1000000, log_interval=1)
I get results:
As picture shows, result is long way from 456 that RL Baselines Zoo got to. I have used more hyperparameters, but scores are always much lower. What I'm aware of that can have impact on this issue is seed, as I didn't pick the same. Nevertheless I have tried many instances of A2C and the problem remains.
Checklist
- [X] I have checked that there is no similar issue in the repo
- [X] I have read the SB3 documentation
- [X] I have read the RL Zoo documentation
- [X] If code there is, it is minimal and working
- [X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.