pytorch-REINFORCE
pytorch-REINFORCE copied to clipboard
Please add some explanation
Hi,
Thank you for the sample code. I could not understand what exactly is happening here: https://github.com/JamesChuanggg/pytorch-REINFORCE/blob/master/reinforce_discrete.py#L52
If possible can you please give a little explanation.
Thanks
it's just maximizing the function.
This is where the loss is being calculated. If you look at the algorithm presented in Suttons book (page 289) it is slightly different from what is given here which is closer to Deep RL - Policy Gradients (page 34).
Basically what is happening is that instead of applying an update step after calculating each advantage * grad log pi, we calculate all the terms and them sum them into the loss so that we can call the backward() on that. I am not sure what the theoretical differences are between applying t updates per episode vs 1 update per episode but I am currently looking into it.
Also confused about the entropies in the loss function, can anyone make a little explanation ?