Vincent Zhuang
Vincent Zhuang
Using k=5, n=100, MAML fails to learn: average training and validation returns consistently hover around 50 throughout all 500 outer loop steps. Any possible discrepancies between this repo's code/config and...
Averaged results over 10 runs for PPO on Walker2d-v3: data:image/s3,"s3://crabby-images/e7277/e7277a14fac37c137e07f353cc3663736ac34de6" alt="walker2dv3normtest"
For Mujoco envs, i's a standard practice to normalize rewards by a running estimate of their standard deviation (e.g. VecNormalize in baselines, NormalizedEnv in rllab). Without it, performance is noticeably...
As per Jaksch et. al 2010, the confidence intervals for UCRL2 use t_k := the timestep at the start of episode k. However, in `run_finite_tabular_experiment` in `experiment.py`, the episode index...