understanding-ai icon indicating copy to clipboard operation
understanding-ai copied to clipboard

Self Imitation Learning

Open flrngel opened this issue 5 years ago • 0 comments

https://arxiv.org/abs/1806.05635

Abstract

  • SIL(Self Imitation Learning) is to verify past good experiences can indirectly drive deep exploration.
  • competitive to state-of-the-art

1. Introduction

  • Atari "Montezuma's Revenge"
    • ends up with a poor policy (A2C)
    • exploiting the experiences that pick up the key makes able to explore onwards
  • Main contributions are,
    1. To study how exploiting past good experiences affects learning (SIL)
    2. Theretical justification of the SIL that is derived from the lower bound of the optimal Q-function
    3. SIL is very simple to implement
    4. Generically applicable to any actor-critic architecture

2. Related work

  • Exploration
  • this paper is different from using that what agent experienced but not yet learned
  • Episodic control
    • Lengyel & Dayan, 2008
    • extreme way of exploiting past experiences in the sense that repeats the best outcome in the past
  • Experience replay
  • Experience replay for actor-critic
    • actor-critic framework can also utilize experience replay
    • difference with off-policy and on-policy stackoverflow
    • off-policy evaluation involves importance sampling (ACER, Reactor; use Retrace to evaluate), that may not benefit much from past experience if the policy in the past is very different from current policy
    • this paper does not involve importance sampling and both applicable to discrete and continuous control
  • Connection between policy gradient and Q-learning
  • Learning from imperfect demonstrations
    • prior works [Liang et al., 2016, Abolafia et al., 2018] used classification loss without justification
    • paper propose a new objective, provide a theoretical justification and systematically investigate how it drives exploration in RL

3. Self Imitation Learning

  • goal is to imitate agent's past good experiences in the actor-critic framework
  • propose to store past episodes with cumulative rewards in replay buffer
  • image
  • off-policy actor-critic loss image image image image

4. Theoretical Justification

image image image image image image image image

5. Experiment

image image image

6. Conclusion

  • proper level of exploitation of past experiences during learning can drive deep exploration and that SIL and exploration methods can be complementary
  • balancing between exploration and exploitation in terms collecting and learning from experiences is an important future research direction

flrngel avatar Sep 16 '18 12:09 flrngel