understanding-ai
understanding-ai copied to clipboard
Self Imitation Learning
https://arxiv.org/abs/1806.05635
Abstract
-
SIL(Self Imitation Learning)
is to verify past good experiences can indirectly drive deep exploration. - competitive to state-of-the-art
1. Introduction
- Atari "Montezuma's Revenge"
- ends up with a poor policy (A2C)
- exploiting the experiences that pick up the key makes able to explore onwards
- Main contributions are,
- To study how exploiting past good experiences affects learning (SIL)
- Theretical justification of the SIL that is derived from the lower bound of the optimal Q-function
- SIL is very simple to implement
- Generically applicable to any actor-critic architecture
2. Related work
- Exploration
- notion of curiousity or uncertainty as a signal for exploration [Schmidhuber, 1991; Strehl & Littman, 2008]
- this paper is different from using that what agent experienced but not yet learned
- Episodic control
- Lengyel & Dayan, 2008
- extreme way of exploiting past experiences in the sense that repeats the best outcome in the past
- Experience replay
-
Lin, 1992
- natural way of exploiting past experiences for parametric policies
- Prioritized experience replay [Moore & Atkeson; Schaul et al., 2016]
- prioitizing past experiences based on temporal difference error
- Optimality tightening He et al., 2017 is similar to this paper
-
Lin, 1992
- Experience replay for actor-critic
- actor-critic framework can also utilize experience replay
- difference with off-policy and on-policy stackoverflow
- off-policy evaluation involves importance sampling (ACER, Reactor; use Retrace to evaluate), that may not benefit much from past experience if the policy in the past is very different from current policy
- this paper does not involve importance sampling and both applicable to discrete and continuous control
- Connection between policy gradient and Q-learning
- difference with Combining policy gradient and Q-learning; PGQL is that paper proposed lower bound Q-learning to exploit good experiences
- Learning from imperfect demonstrations
- prior works [Liang et al., 2016, Abolafia et al., 2018] used classification loss without justification
- paper propose a new objective, provide a theoretical justification and systematically investigate how it drives exploration in RL
3. Self Imitation Learning
- goal is to imitate agent's past good experiences in the actor-critic framework
- propose to store past episodes with cumulative rewards in replay buffer
-
- off-policy actor-critic loss
4. Theoretical Justification
5. Experiment
6. Conclusion
- proper level of exploitation of past experiences during learning can drive deep exploration and that SIL and exploration methods can be complementary
- balancing between exploration and exploitation in terms collecting and learning from experiences is an important future research direction