stable-baselines
stable-baselines copied to clipboard
[Question] GAIL generator batch size
Hello. How should the GAIL generator batch size here scale with, say, discriminator batch size here or with self.timesteps_per_batch
? Is there a reason why it is fixed, or is it dependent on any other parameter? Thanks.
P.S. Yes, I have been directed to another codebase for GAIL many times, but I just want to run some quick experiments :P
Hello,
Is there a reason why it is fixed, or is it dependent on any other parameter?
this is legacy code from OpenAI baselines... so only @andrewliao11 knows probably... and the fixed batch size is apparently only for the value function.
Thanks. While I wait, I also have a question on PPO2 (don't want to open another issue):
I see there are self.n_envs
environments running in parallel. I understand this might not make sense, but is it possible to get some kind of "episode number" value from, say, the rollout here, or any other parameter? Most algorithms have an episode count, and it would be nice to have a common axis (other than total_timesteps
) to plot stuff with.
Most algorithms have an episode count, and it would be nice to have a common axis (other than total_timesteps) to plot stuff with.
If you use a Monitor
wrapper (included in make_vec_env
or the rl zoo), then you can plot the results by episodes (cf doc on results plotter).
You have a concrete example here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/scripts/plot_train.py#L33
(with SB3, replace X_TIMESTEPS
by X_EPISODES
).
Anyway, as mentioned in the doc, I would recommend using an EvalCallback
instead of training reward for plotting learning curve.
Thanks!
Anyway, as mentioned in the doc, I would recommend using an
EvalCallback
instead of training reward for plotting learning curve
Ohh. Could you expand more on this? I am using EvalCallback
during training to check performance and save the best model (1), instead of saving the last available model with model.save()
(2). Does that make sense? Can I call (1) my trained model?
I ask because the model.learn()
mean_reward
logs are significantly different from the EvalCallback
episode_reward
logs. I am not sure which is the 'correct' reward value.
E.g. running TRPO
on Hopper-v2 has a model.learn()
mean_reward
says 3000, whereas EvalCallback
episode_reward
best value of 3600. Unfortunately, TensorBoard plots only show the former :\
FYI I have set the seed, set n_cpu_tf_sess=1
and deterministic=True
where needed, so there is no randomness. I could post this as a separate issue if needed.
Could you expand more on this? I am using EvalCallback during training to check performance and save the best model
This is unrelated to the issue, please read the doc and take a look at the rl zoo.
E.g. running TRPO on Hopper-v2 has a model.learn() mean_reward says 3000, whereas EvalCallback episode_reward best value of 3600.
Duplicate of https://github.com/hill-a/stable-baselines/issues/581#issuecomment-557984749