Alexander Nikulin comments

Results 30 comments of


Alexander Nikulin

Average PPO implementation

@vwxyzjn I'd appreciate if you could take a quick look at the code (without going into details) to check that I match the style of the rest of the library.

Average PPO implementation

I also don't quite understand the decision to evaluate an agent by episodic reward and with stochastic actions. This is especially noticeable with --capture-video as it slows down the training...

Average PPO implementation

> What is the problem and how is it related to capture-video Always capturing video from one of the envs during training has a noticeable overhead (3x slower on my...

Average PPO implementation

Yup, I started to run some stuff. At first I will be testing on Swimmer-v3 as the results on it are very different from the other environments with PPO (for...

Average PPO implementation

Well, the paper indicates only the enumeration sweep that they used, but not the final best ones. :disappointed:

Average PPO implementation

Yes, it is it! The most important parameters are given here only as a grid for a sweep. It doesn't specify `num-envs`/`num-steps` also, but I suppose we could leave them...

Average PPO implementation

@vwxyzjn So, what is the policy of submitting runs to the wandb? Should I first experiment on my local private project and then re-run final evaluation to the `openrlbenchmark/cleanrl`? Or...

Average PPO implementation

First sanity-check on 3 seeds, seems like it is working as expected on Swimmer-v3. Even better than in paper, but they use more seeds. ![W B Chart 22 06 2022,...

It is a deep mystery to me why it works so well on this particular environment. Algorithms based on discounted reward can only [solve](https://github.com/thu-ml/tianshou/issues/401) it if you set `gamma=0.9999`, but...

Average PPO implementation

@vwxyzjn Results for APO Gym Mujoco will be in this report: https://wandb.ai/openrlbenchmark/cleanrl/reports/-WIP-APO-on-Gym-Mujoco---VmlldzoyMjEwMjY4 Feel free to edit or suggest changes. Runtime ideally should be same as PPO (as there is no...