ReinforcementLearning.jl
ReinforcementLearning.jl copied to clipboard
TRPO
this PR implements Trust-Region Policy Optimization, and adds a CartPole experiment for it.
to this end, i wrote a few utility functions that are shared amongst policy gradient policies (#737). but perhaps a better way to go about it is to have a PolicyGradientPolicy type, and have it wrap different learners.
Looks fine to me in general. I think there's still room for improvement in the gradient part. I'll add more detailed comments this weekend.
I'll merge this first. I may find some time in the next week to polish this further ;)