Self Imitation Learning

Open flrngel opened this issue 6 years ago • 0 comments

https://arxiv.org/abs/1806.05635

Abstract

SIL(Self Imitation Learning) is to verify past good experiences can indirectly drive deep exploration.
competitive to state-of-the-art

1. Introduction

Atari "Montezuma's Revenge"
- ends up with a poor policy (A2C)
- exploiting the experiences that pick up the key makes able to explore onwards
Main contributions are,
1. To study how exploiting past good experiences affects learning (SIL)
2. Theretical justification of the SIL that is derived from the lower bound of the optimal Q-function
3. SIL is very simple to implement
4. Generically applicable to any actor-critic architecture

2. Related work

Exploration
- notion of curiousity or uncertainty as a signal for exploration [Schmidhuber, 1991; Strehl & Littman, 2008]
this paper is different from using that what agent experienced but not yet learned
Episodic control
- Lengyel & Dayan, 2008
- extreme way of exploiting past experiences in the sense that repeats the best outcome in the past
Experience replay
- Lin, 1992
  - natural way of exploiting past experiences for parametric policies
- Prioritized experience replay [Moore & Atkeson; Schaul et al., 2016]
  - prioitizing past experiences based on temporal difference error
- Optimality tightening He et al., 2017 is similar to this paper
Experience replay for actor-critic
- actor-critic framework can also utilize experience replay
- difference with off-policy and on-policy stackoverflow
- off-policy evaluation involves importance sampling (ACER, Reactor; use Retrace to evaluate), that may not benefit much from past experience if the policy in the past is very different from current policy
- this paper does not involve importance sampling and both applicable to discrete and continuous control
Connection between policy gradient and Q-learning
- difference with Combining policy gradient and Q-learning; PGQL is that paper proposed lower bound Q-learning to exploit good experiences
Learning from imperfect demonstrations
- prior works [Liang et al., 2016, Abolafia et al., 2018] used classification loss without justification
- paper propose a new objective, provide a theoretical justification and systematically investigate how it drives exploration in RL

3. Self Imitation Learning

goal is to imitate agent's past good experiences in the actor-critic framework
propose to store past episodes with cumulative rewards in replay buffer
off-policy actor-critic loss

4. Theoretical Justification

5. Experiment

6. Conclusion

proper level of exploitation of past experiences during learning can drive deep exploration and that SIL and exploration methods can be complementary
balancing between exploration and exploitation in terms collecting and learning from experiences is an important future research direction

Sep 16 '18 12:09 flrngel

understanding-ai understanding-ai copied to clipboard

Self Imitation Learning

Abstract

1. Introduction

2. Related work

3. Self Imitation Learning

4. Theoretical Justification

5. Experiment

6. Conclusion

understanding-ai
understanding-ai copied to clipboard