Vincent Zhuang issues

Results 4 issues of


                                            Vincent Zhuang

Fails to converge on bandit tasks

Using k=5, n=100, MAML fails to learn: average training and validation returns consistently hover around 50 throughout all 500 outer loop steps. Any possible discrepancies between this repo's code/config and...

Normalize rewards by standard deviation of discounted return in MuJoCo

Averaged results over 10 runs for PPO on Walker2d-v3: ![walker2dv3normtest](https://user-images.githubusercontent.com/10367284/79826905-ca6dbb00-8351-11ea-8a24-efcafad53fa7.png)

Normalizing environment wrapper

For Mujoco envs, i's a standard practice to normalize rewards by a running estimate of their standard deviation (e.g. VecNormalize in baselines, NormalizedEnv in rllab). Without it, performance is noticeably...

UCRL2/UCFH confidence intervals are incorrect

As per Jaksch et. al 2010, the confidence intervals for UCRL2 use t_k := the timestep at the start of episode k. However, in `run_finite_tabular_experiment` in `experiment.py`, the episode index...