deep-rl-class
deep-rl-class copied to clipboard
Fix unit5 reinforce implementation
The provided implementation computes the policy loss considering only the return G_0, as: sum over all t of the G_0 *log_policy(a_t|s_t)
However, the reinforce algorithm requires to compute the returns G_t for all t, to then calculate the sum over all t of the G_t *log_policy(a_t|s_t).
Hence, the wrong computation has been fixed to consider the discounted returns at each timestep.
Note: the policy loss has still not been multiplied by the constant 1/n_step, to maintain the least number of changes to obtain the correctly working implementation (a scaling constant would be accounted for by the learning rate anyway in the optimization process)
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Hey there 👋 sorry for the delay. Thanks for your help on reinforce integration 🤗 There's a conflict between branches so I need to check that first. Also I'm currently checking the pytorch implementation provided by the official pytorch repo before making an update on this one. I keep you updated.
Thanks for your help,
Hey there wave sorry for the delay. Thanks for your help on reinforce integration hugs There's a conflict between branches so I need to check that first. Also I'm currently checking the pytorch implementation provided by the official pytorch repo before making an update on this one. I keep you updated.
Thanks for your help,
Hello Thomas, sorry for the huge delay, but in the end i managed to get back to this😅 I created a new merge request without the conflicts at https://github.com/huggingface/deep-rl-class/pull/95 I have also checked the official pytorch REINFORCE Implementation and added the standardization of the returns as they also do. I have also noticed that this merge request provides also improvements in speed of the computation. They use the same computation of the returns as is provided in this merge request. Furthermore, it seems that the udacity implementation of REINFORCE got it wrong too https://github.com/udacity/deep-reinforcement-learning/blob/master/reinforce/REINFORCE.ipynb
Anyway, more details on this are also in the new merge request.
Thanks for taking the time to review the merge req!