Sampreet
Sampreet
At the end, add a `trainer.evaluate()`. That will make sure that the greedy policy is followed each time. Should give 500.
What `trainer.evaluate() ` does is it makes sure that whenever an action is selected, the deterministic policy is followed (see the VPG implementation). Not sure specifically about VPG but it's...
Yes it should. But the stochasticity maybe remains the same. That way the agent maybe has already learned the optimal policy but will still continue to explore in the same...
> When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector...
This is up for grabs!
Are you done here? @hades-rp2010 If you can resolve the merge conflicts and maybe the codeclimate issues then we can merge this.
Not sure. I checked a couple of files here and there and they weren't. Feel free to remove them from the list if they're already done. There's a lot of...
It's just `batch_size` rn. Don't look at the older docstrings. Just look at the function init variables
Sure!
Tbh, shifting to scientific notation doesn't sound like a good idea. For the simple reason that it doesn't look good. The current logger takes care of the fixed length problem...