genrl
genrl copied to clipboard
Usage explanatory docs
Go to the docs/source/usage/tutorials
and add separate .md
files to explain the following:
- [x] Using A2C (@Darshan-ko )
- [ ] Using PPO1
- [x] Using VPG (@Devanshu24 )
- [ ] Using DQN(s)
- [ ] Using DDPG
- [ ] Using TD3
- [ ] Using SAC
- [ ] Demonstrate Saving model parameters
- [ ] Demonstrate Loading pretrained models
- [x] Using Multi-Armed Bandits (@Darshan-ko )
- [x] Using Contextual Bandits (@Darshan-ko )
When working on this issue, it is important to explain the algorithms as well and not just have what's present in the readme already.
Also add the entry to the turtorial in docs/source/usage/tutorials/index.rst
EDIT: Changed docs/source
to docs/source/tutorials
EDIT2: Changed docs/source/tutorials
to docs/source/usage/tutorials
Hi! Could you give a brief idea as to what should be included in those files?
Things to be included:
- Example code to run the algo
- Links to the source docs of the relevant algos
- Hyperparameters/arguments you can customise. Like for example show usage of
rollout_size
or using different architectures ("cnn" and "mlp")
Feel free to add anything you think would be helpful for a beginner to understand how to use our repo. We'll iron out the details when you put up a PR.
I'll work on adding the docs for using a2c if no one has taken it up already?
When I tried to run A2C on 'CartPole-v0', the policy loss and policy entropy just go to 0 after a few epochs(around 15), and thus the performance just gets stuck in a local optima with approximately the same mean reward there onwards. Is this happening due to vanishing gradients?
Is the mean reward dropping? Also run trainer.evaluate
for a few episodes to check if the final mean reward is 200.0 or not. Our logger rounds of the loss values. The actual policy_loss
can go to very small values like 1e-6 etc. That's alright. (same with entropy)
Yes, the mean reward is also dropping, starting from around 23.67 and stabilises at around 9.3. trainer.evaluates also shows the same.
Does it go to around 160-180 in the middle? It's a known issue that our A2C is unstable and suddenly drops in performance midway. If it's not training at all, you can raise an issue for that separately.
Oh yes most of the times, it does go to 170-180ish in the middle for 2-3 episodes and then drops sharply. What is the reason for this instability?
Yeah, so it's fine for now. Not sure why our A2C collapses all of a sudden. There isn't any problem with the logic.
I'd like to write the docs for VPG, if that's okay
For anyone working on this, please add the files to docs/source/tutorials
not docs/source
I can take up using multi armed bandits and contextual bandits.
What is the definition of timestep
which is displayed on the console during training?
Is it the time from the start of training (I think it looked like it), or is it the timestep from the start of an epoch?
Edit: Ok now I am almost certain it is the former, still want to confirm it to be sure
Timestep from the very beginning
import gym
from genrl import VPG
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPole-v1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=1000)
trainer.train()
I tried running VPG with this code, the mean_reward
is a maximum of just 409.6
, and it continuously then stays there(kinda converges to 409.6). Not sure why exactly that specific number, but I ran it multiple times and it's the same always
Is there a way to get/plot the max_rewards
in each epoch
sort of a thing?
UPD: I ran it for 5000 epochs and it reaches a max mean_reward
of 409.6 by ~2000 epochs and after that it starts going crashing down and goes to 160ish range
At the end, add a trainer.evaluate()
. That will make sure that the greedy policy is followed each time. Should give 500.
At the end, add a trainer.evaluate(). That will make sure that the greedy policy is followed each time. Should give 500.
Okay but from what I understand, it'll use the learnt policy, so even if it reached 500 just once during the learning phase it'll be able to achieve a 500 during a greedy eval
But when I ran it for 5000 epochs it doesn't converge to anywhere close to 500 (I may be wrong but this shouldn't happen right?)
Attached the log for reference vpg-genrl.txt
What trainer.evaluate()
does is it makes sure that whenever an action is selected, the deterministic policy is followed (see the VPG implementation). Not sure specifically about VPG but it's likely that it doesn't always follow a deterministic policy unless it's explicitly set to do so like in evaluate.
Oh okay, thanks! I'll look into the implementation again. My question was even if it is following a stochastic policy the policy should improve over time(over the course of time of trainer.train) right?
Yes it should. But the stochasticity maybe remains the same. That way the agent maybe has already learned the optimal policy but will still continue to explore in the same proportion. It's weird that it gets stuck at 406.9 but it's not a problem if the greedy policy converges.
Oh okay! I'll have to read up a bit more on this to get a better understanding, thanks!
I can take up using multi armed bandits and contextual bandits.
@Darshan-ko can do this after #176 is merged (which will mostly happen today). There are a lot of significat changes for bandits in this PR.
@threewisemonkeys-as Ok cool.
Its merged now, btw
When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.
When we use CNNs are feature extractors in RL algorithms, we generally learn the CNN + MLP during training of the RL agent. So when you do loss.backward()
for either the policy of the value function, this loss gets propogated all the way back to the CNN and optimises its representation specifically for that agent. So there is no need to train / pretrain it seperately.
Infact it would be pretty hard to train the CNN independatly of the RL agent since you have no labeled data in this scenario.
When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.
If you've noticed, then we use architecture "cnn" for CNN architectures for Atari envs. What that does is use a CNNValue
from deep/common/values.py
@threewisemonkeys-as But doesn't it seem counter-intuitive that optimizing the loss function of the policy and value functions can help in learning the CNN parameters? As in is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?
@sampreet-arthi I got that but I could not find an explicit loss function or loss.backward()
for the CNN, so I was confused about the same.
See your policy pi
is supposed to be any function mapping states to actions with parameters. For Deep RL its usually some Neural Network but it can even be a simple linear function which multiplies the state by a certain value (the parameter).
When we do policy optimisation (through policy gradient) we basically look at how the policy is performing and compute gradients in the direction of better performance. These gradients are w.r.t. to the parameters of the policy and we update these parameters accrodingly to make the policy perform better.
Now in the case of Atari, the policy consits of both a CNN and MLP so the parameters of both need to be updated when optimising the policy. Over the course of training, the CNN will eventually learn to give the optimum features required for the policy to do well.
is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?
No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.
No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.
Oh ok I get it. So if I understood correctly, this is because there is no particular objective for the CNN other than it providing features on which we can train the policy and value function on and find an optimum policy right?