PapersAnalysis icon indicating copy to clipboard operation
PapersAnalysis copied to clipboard

Reinforcement Learning, Fast and Slow

Open NicolaBernini opened this issue 5 years ago • 6 comments

Overview

Paper Readthrough related to the original paper

Reinforcement Learning, Fast and Slow

  • In depth analysis can be also found here and it is open for collaborative updating via Google Docs Comments

Index

NicolaBernini avatar Jun 12 '19 13:06 NicolaBernini

DRL

  • RL is about learning a mapping between a State / Situation Space and an Action Space
    • For small and discrete spaces, it can be represented as a LUT
    • For big and continuous ones, it is better to represent it as function
    • NN is a generic tool to learn functions in a data driven way
    • DNN is a specific class of NN relying on the inductive bias of depth or hierarchical structure
    • DNN + RL = DRL

Paper Key Points

The comparison between Human and current DRL alogs shows there is a huge difference in terms of samples efficiency (how many samples are needed to achieve a certain performance): humans learn way faster than current DRL Algos so there are many interesting scientific questions here:

  • why
  • how to improve this

Learning speed is an important limiting factor to overcome in order to be able to move DRL outside of the niche of games, in more realistic situations

Current DRL Algos Learning Performance

To attain expert human-level performance on tasks such as Atari video games or chess, deep RL systems have required many orders of magnitude more training data than human experts themselves [22]. The critique is indeed applicable to the first wave of deep RL methods, reported beginning around 2013 (e.g., [25]). However, even in the short time since then, important innovations have occurred in deep RL research, which show how the sample efficiency of deep RL can be dramatically increased.

  • The main point is the first wave of DRL Algos was very slow (i.e. low samples efficiency) but a second wave of algos provided innovative techniques to improve the critical aspect of scaling learning

DRL Learning

Optimization Approach

  • Optimization as a generic tool for learning
    • NN as a generic model for functions to learn
    • Objective Function defining a Surface to navigate in search of good parametrizations = minima
  • Gradient Descent as Engine or Force driving the exploration
  • Objective Function typically depends on some kind of feedback which is typically provided via a supervision signal, in RL this is called reward and is typically much sparser than in supervised learning
    • the reward sparsity makes the learning process hard to scale

How to plan

  • Predict the cumulative future reward
    • the prediction depends on the reward model (the capability of predicting the reward)
  • Make the decision = perform an action
  • Observe the reward when it is available
  • Correct the reward model

State Space

  • The State Space is the input space for the policy function
  • It can be -- Memory-less or Markov vs with memory -- Low dim vs high dim (typically related to high dim sensors like images, ...)
  • Some examples
  • Panel (a) is low dim and stateless: it is easy to compress the backgammon board status beyond its original pixel representation in a very low dim encoding
  • Panel (b) is high dim and stateless: the Space Invaders image frame represents all the relevant information (no need for the history) but it is harder than the backgammon case, though not super hard, to define a handcrafted method to compress it beyond its original pixel representation
  • Panel (c) is high dim and stateful: the Maze image in itself is not sufficient to make a decision as history is very important (hence not Markov), furthermore it is hard to define an handcrafted method to compress the image beyond its pixel representation

NicolaBernini avatar Jun 12 '19 13:06 NicolaBernini

Slow DRL

Source of slow learning

  • Gradient based methods
  • Inductive Bias: Generality vs Learning Speed Trade-off

Gradient based methods

  • Can be framed as Exploration vs Exploitation

    • At the beginning, when the network is in a Random Config State, and there is “nothing to lose” (i.e. the network has not learned anything yet), then the state space exploration can be faster
    • When the network has already learned something, the learning process becomes slower because of the above mentioned tradeoff between: learning new things and maintaining the acquired knowledge
  • Learning via gradient based methods means smaller and smaller increments in order to be able to check every update does not “break” what has been learned

  • Goals

    • trying to pursue generalization
    • trying to avoid overfitting of past knowledge (what has been learned so far)
      • NOTE: past knowledge is not special in any way, it has just been learned first by chance
  • Furthermore, as there is no prior knowledge about the Loss Function Landscape and the State Space is too big to perform a detailed exploration, the preference is to go for continuous improvements so to avoid breaking anything, so the learning becomes greedier (exploration leads to an increasing risk of loss as the NN accumulates more) and slower

Inductive Bias: Generality vs Learning Speed Trade-off

  • Inductive Bias represents initial assumptions about the pattern to be learned

    • As a consequence, it restricts or shapes the State Space so to make the learning faster and more effective
    • At the same time, it makes the learning machine less general
  • Neural Networks are very general learning machines, in fact training them takes a lot of time / computational power

  • Deep Neural Networks is a class of Neural Networks relying on Hierarchical Structure Inductive Bias which makes them be effective in solving computer vision tasks

NicolaBernini avatar Jun 12 '19 13:06 NicolaBernini

Definitions

Samples Efficiency

Sample efficiency refers to the amount of data required for a learning system to attain any chosen target level of performance.

  • How much effort, measured in samples, is needed for an agent to learning something

Policy Learning

  • Representing the policy function in the NN framework to learn it
  • Use optimization to perform effective learning
  • NN are differentiable hence it is possible to use algorithms as gradient descent to learn the NN params out of some objective function

NicolaBernini avatar Nov 09 '19 16:11 NicolaBernini

High Level Analysis

image

NicolaBernini avatar Dec 26 '19 09:12 NicolaBernini

Tasks Complexity

image

img1

NicolaBernini avatar Dec 26 '19 09:12 NicolaBernini

Episodic Memory

image

NicolaBernini avatar Dec 26 '19 09:12 NicolaBernini