genrl Usage explanatory docs

Go to the docs/source/usage/tutorials and add separate .md files to explain the following:

[x] Using A2C (@Darshan-ko )
[ ] Using PPO1
[x] Using VPG (@Devanshu24 )
[ ] Using DQN(s)
[ ] Using DDPG
[ ] Using TD3
[ ] Using SAC
[ ] Demonstrate Saving model parameters
[ ] Demonstrate Loading pretrained models
[x] Using Multi-Armed Bandits (@Darshan-ko )
[x] Using Contextual Bandits (@Darshan-ko )

When working on this issue, it is important to explain the algorithms as well and not just have what's present in the readme already.

Also add the entry to the turtorial in docs/source/usage/tutorials/index.rst

EDIT: Changed docs/source to docs/source/tutorials EDIT2: Changed docs/source/tutorials to docs/source/usage/tutorials

Jul 18 '20 09:07 sampreet-arthi

Hi! Could you give a brief idea as to what should be included in those files?

Jul 18 '20 10:07 Devanshu24

Things to be included:

Example code to run the algo
Links to the source docs of the relevant algos
Hyperparameters/arguments you can customise. Like for example show usage of rollout_size or using different architectures ("cnn" and "mlp")

Feel free to add anything you think would be helpful for a beginner to understand how to use our repo. We'll iron out the details when you put up a PR.

Jul 18 '20 18:07 sampreet-arthi

I'll work on adding the docs for using a2c if no one has taken it up already?

Jul 21 '20 04:07 Darshan-ko

When I tried to run A2C on 'CartPole-v0', the policy loss and policy entropy just go to 0 after a few epochs(around 15), and thus the performance just gets stuck in a local optima with approximately the same mean reward there onwards. Is this happening due to vanishing gradients?

Jul 23 '20 07:07 Darshan-ko

Is the mean reward dropping? Also run trainer.evaluate for a few episodes to check if the final mean reward is 200.0 or not. Our logger rounds of the loss values. The actual policy_loss can go to very small values like 1e-6 etc. That's alright. (same with entropy)

Jul 23 '20 07:07 sampreet-arthi

Yes, the mean reward is also dropping, starting from around 23.67 and stabilises at around 9.3. trainer.evaluates also shows the same.

Jul 23 '20 08:07 Darshan-ko

Does it go to around 160-180 in the middle? It's a known issue that our A2C is unstable and suddenly drops in performance midway. If it's not training at all, you can raise an issue for that separately.

Jul 23 '20 14:07 sampreet-arthi

Oh yes most of the times, it does go to 170-180ish in the middle for 2-3 episodes and then drops sharply. What is the reason for this instability?

Jul 23 '20 17:07 Darshan-ko

Yeah, so it's fine for now. Not sure why our A2C collapses all of a sudden. There isn't any problem with the logic.

Jul 23 '20 21:07 sampreet-arthi

I'd like to write the docs for VPG, if that's okay

Jul 25 '20 09:07 Devanshu24

For anyone working on this, please add the files to docs/source/tutorials not docs/source

Jul 25 '20 19:07 sampreet-arthi

I can take up using multi armed bandits and contextual bandits.

Aug 01 '20 07:08 Darshan-ko

What is the definition of timestep which is displayed on the console during training? Is it the time from the start of training (I think it looked like it), or is it the timestep from the start of an epoch?

Edit: Ok now I am almost certain it is the former, still want to confirm it to be sure

Aug 01 '20 10:08 Devanshu24

Timestep from the very beginning

Aug 01 '20 13:08 sampreet-arthi

import gym

from genrl import VPG
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=1000)
trainer.train()

I tried running VPG with this code, the mean_reward is a maximum of just 409.6, and it continuously then stays there(kinda converges to 409.6). Not sure why exactly that specific number, but I ran it multiple times and it's the same always

Is there a way to get/plot the max_rewards in each epoch sort of a thing?

UPD: I ran it for 5000 epochs and it reaches a max mean_reward of 409.6 by ~2000 epochs and after that it starts going crashing down and goes to 160ish range

Aug 01 '20 13:08 Devanshu24

At the end, add a trainer.evaluate(). That will make sure that the greedy policy is followed each time. Should give 500.

Aug 01 '20 17:08 sampreet-arthi

At the end, add a trainer.evaluate(). That will make sure that the greedy policy is followed each time. Should give 500.

Okay but from what I understand, it'll use the learnt policy, so even if it reached 500 just once during the learning phase it'll be able to achieve a 500 during a greedy eval

But when I ran it for 5000 epochs it doesn't converge to anywhere close to 500 (I may be wrong but this shouldn't happen right?)

Attached the log for reference vpg-genrl.txt

Aug 01 '20 17:08 Devanshu24

What trainer.evaluate() does is it makes sure that whenever an action is selected, the deterministic policy is followed (see the VPG implementation). Not sure specifically about VPG but it's likely that it doesn't always follow a deterministic policy unless it's explicitly set to do so like in evaluate.

Aug 01 '20 20:08 sampreet-arthi

Oh okay, thanks! I'll look into the implementation again. My question was even if it is following a stochastic policy the policy should improve over time(over the course of time of trainer.train) right?

Aug 01 '20 21:08 Devanshu24

Yes it should. But the stochasticity maybe remains the same. That way the agent maybe has already learned the optimal policy but will still continue to explore in the same proportion. It's weird that it gets stuck at 406.9 but it's not a problem if the greedy policy converges.

Aug 01 '20 21:08 sampreet-arthi

Oh okay! I'll have to read up a bit more on this to get a better understanding, thanks!

Aug 01 '20 21:08 Devanshu24

I can take up using multi armed bandits and contextual bandits.

@Darshan-ko can do this after #176 is merged (which will mostly happen today). There are a lot of significat changes for bandits in this PR.

Aug 02 '20 05:08 threewisemonkeys-as

@threewisemonkeys-as Ok cool.

Aug 02 '20 06:08 Darshan-ko

Its merged now, btw

Aug 02 '20 06:08 Sharad24

When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.

Aug 02 '20 12:08 Darshan-ko

When we use CNNs are feature extractors in RL algorithms, we generally learn the CNN + MLP during training of the RL agent. So when you do loss.backward() for either the policy of the value function, this loss gets propogated all the way back to the CNN and optimises its representation specifically for that agent. So there is no need to train / pretrain it seperately.

Infact it would be pretty hard to train the CNN independatly of the RL agent since you have no labeled data in this scenario.

Aug 02 '20 12:08 threewisemonkeys-as

When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.

If you've noticed, then we use architecture "cnn" for CNN architectures for Atari envs. What that does is use a CNNValue from deep/common/values.py

Aug 02 '20 13:08 sampreet-arthi

@threewisemonkeys-as But doesn't it seem counter-intuitive that optimizing the loss function of the policy and value functions can help in learning the CNN parameters? As in is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?

@sampreet-arthi I got that but I could not find an explicit loss function or loss.backward() for the CNN, so I was confused about the same.

Aug 02 '20 13:08 Darshan-ko

See your policy pi is supposed to be any function mapping states to actions with parameters. For Deep RL its usually some Neural Network but it can even be a simple linear function which multiplies the state by a certain value (the parameter).

When we do policy optimisation (through policy gradient) we basically look at how the policy is performing and compute gradients in the direction of better performance. These gradients are w.r.t. to the parameters of the policy and we update these parameters accrodingly to make the policy perform better.

Now in the case of Atari, the policy consits of both a CNN and MLP so the parameters of both need to be updated when optimising the policy. Over the course of training, the CNN will eventually learn to give the optimum features required for the policy to do well.

is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?

No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.

Aug 02 '20 13:08 threewisemonkeys-as

No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.

Oh ok I get it. So if I understood correctly, this is because there is no particular objective for the CNN other than it providing features on which we can train the policy and value function on and find an optimum policy right?

Aug 02 '20 14:08 Darshan-ko

genrl genrl copied to clipboard

Usage explanatory docs

genrl
genrl copied to clipboard