zhezherun

Results 3 issues of zhezherun

If GaussianPolicy receives a batched TimeStep, it applies the same noise to all actions returned by the wrapped policy. Instead, it should sample a different noise term per batch element....

If an episode is truncated by the time limit wrapper, the last discount in that episode is set to 1.0 instead of 0.0. As a result, both the reward and...

The following code in `PPOAgent.compute_advantages` ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the `generalized_advantage_estimation` function twice: ```python # Arg value_preds was appended...