stable-baselines icon indicating copy to clipboard operation
stable-baselines copied to clipboard

[Question] GAIL generator batch size

Open prabhasak opened this issue 4 years ago • 5 comments

Hello. How should the GAIL generator batch size here scale with, say, discriminator batch size here or with self.timesteps_per_batch? Is there a reason why it is fixed, or is it dependent on any other parameter? Thanks.

P.S. Yes, I have been directed to another codebase for GAIL many times, but I just want to run some quick experiments :P

prabhasak avatar Sep 22 '20 10:09 prabhasak

Hello,

Is there a reason why it is fixed, or is it dependent on any other parameter?

this is legacy code from OpenAI baselines... so only @andrewliao11 knows probably... and the fixed batch size is apparently only for the value function.

araffin avatar Sep 22 '20 10:09 araffin

Thanks. While I wait, I also have a question on PPO2 (don't want to open another issue):

I see there are self.n_envs environments running in parallel. I understand this might not make sense, but is it possible to get some kind of "episode number" value from, say, the rollout here, or any other parameter? Most algorithms have an episode count, and it would be nice to have a common axis (other than total_timesteps) to plot stuff with.

prabhasak avatar Sep 22 '20 21:09 prabhasak

Most algorithms have an episode count, and it would be nice to have a common axis (other than total_timesteps) to plot stuff with.

If you use a Monitor wrapper (included in make_vec_env or the rl zoo), then you can plot the results by episodes (cf doc on results plotter). You have a concrete example here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/scripts/plot_train.py#L33 (with SB3, replace X_TIMESTEPS by X_EPISODES).

Anyway, as mentioned in the doc, I would recommend using an EvalCallback instead of training reward for plotting learning curve.

araffin avatar Sep 23 '20 08:09 araffin

Thanks!

Anyway, as mentioned in the doc, I would recommend using an EvalCallback instead of training reward for plotting learning curve

Ohh. Could you expand more on this? I am using EvalCallback during training to check performance and save the best model (1), instead of saving the last available model with model.save() (2). Does that make sense? Can I call (1) my trained model?

I ask because the model.learn() mean_reward logs are significantly different from the EvalCallback episode_reward logs. I am not sure which is the 'correct' reward value.

E.g. running TRPO on Hopper-v2 has a model.learn() mean_reward says 3000, whereas EvalCallback episode_reward best value of 3600. Unfortunately, TensorBoard plots only show the former :\

FYI I have set the seed, set n_cpu_tf_sess=1 and deterministic=True where needed, so there is no randomness. I could post this as a separate issue if needed.

prabhasak avatar Sep 23 '20 09:09 prabhasak

Could you expand more on this? I am using EvalCallback during training to check performance and save the best model

This is unrelated to the issue, please read the doc and take a look at the rl zoo.

E.g. running TRPO on Hopper-v2 has a model.learn() mean_reward says 3000, whereas EvalCallback episode_reward best value of 3600.

Duplicate of https://github.com/hill-a/stable-baselines/issues/581#issuecomment-557984749

araffin avatar Sep 23 '20 09:09 araffin