deep-rl-class icon indicating copy to clipboard operation
deep-rl-class copied to clipboard

Fix unit5 reinforce implementation

Open Chris1nexus opened this issue 2 years ago • 2 comments

The provided implementation computes the policy loss considering only the return G_0, as: sum over all t of the G_0 *log_policy(a_t|s_t)

However, the reinforce algorithm requires to compute the returns G_t for all t, to then calculate the sum over all t of the G_t *log_policy(a_t|s_t).

Hence, the wrong computation has been fixed to consider the discounted returns at each timestep.

Note: the policy loss has still not been multiplied by the constant 1/n_step, to maintain the least number of changes to obtain the correctly working implementation (a scaling constant would be accounted for by the learning rate anyway in the optimization process)

Chris1nexus avatar Jul 24 '22 11:07 Chris1nexus

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Hey there 👋 sorry for the delay. Thanks for your help on reinforce integration 🤗 There's a conflict between branches so I need to check that first. Also I'm currently checking the pytorch implementation provided by the official pytorch repo before making an update on this one. I keep you updated.

Thanks for your help,

simoninithomas avatar Aug 24 '22 12:08 simoninithomas

Hey there wave sorry for the delay. Thanks for your help on reinforce integration hugs There's a conflict between branches so I need to check that first. Also I'm currently checking the pytorch implementation provided by the official pytorch repo before making an update on this one. I keep you updated.

Thanks for your help,

Hello Thomas, sorry for the huge delay, but in the end i managed to get back to this😅 I created a new merge request without the conflicts at https://github.com/huggingface/deep-rl-class/pull/95 I have also checked the official pytorch REINFORCE Implementation and added the standardization of the returns as they also do. I have also noticed that this merge request provides also improvements in speed of the computation. They use the same computation of the returns as is provided in this merge request. Furthermore, it seems that the udacity implementation of REINFORCE got it wrong too https://github.com/udacity/deep-reinforcement-learning/blob/master/reinforce/REINFORCE.ipynb

Anyway, more details on this are also in the new merge request.

Thanks for taking the time to review the merge req!

Chris1nexus avatar Oct 17 '22 02:10 Chris1nexus