typewriter Ranking Policy Gradient

Ranking Policy Gradient

Open redknightlois opened this issue 4 years ago • 0 comments

This extension to DQN and other algorithms looks pretty interesting to smooth out the variance of the Q estimations.

From the abstract:

Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art uses value function to derive policy while it usually requires an extensive search over the state-action space, which is one reason for the inefficiency. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal ranking of a set of discrete actions. To accelerate the learning of policy gradient methods, we describe a novel off-policy learning framework and establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general sample-efficient off-policy learning framework, which accelerates learning and reduces variance. Furthermore, the sample complexity of RPG does not depend on the dimension of state space, which enables RPG for large-scale problems. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art.

You should take a look: https://arxiv.org/abs/1906.09674

Reference implementation of the RPG variants of various algorithms from the authors: https://github.com/illidanlab/rpg/tree/master/dopamine/dopamine/agents

Jul 22 '19 12:07 redknightlois

typewriter typewriter copied to clipboard

Ranking Policy Gradient

typewriter
typewriter copied to clipboard