Overview

Paper Readthrough related to the original paper

Reinforcement Learning, Fast and Slow

In depth analysis can be also found here and it is open for collaborative updating via Google Docs Comments

Index

Jun 12 '19 13:06 NicolaBernini

DRL

RL is about learning a mapping between a State / Situation Space and an Action Space
- For small and discrete spaces, it can be represented as a LUT
- For big and continuous ones, it is better to represent it as function
- NN is a generic tool to learn functions in a data driven way
- DNN is a specific class of NN relying on the inductive bias of depth or hierarchical structure
- DNN + RL = DRL

Paper Key Points

The comparison between Human and current DRL alogs shows there is a huge difference in terms of samples efficiency (how many samples are needed to achieve a certain performance): humans learn way faster than current DRL Algos so there are many interesting scientific questions here:

why
how to improve this

Learning speed is an important limiting factor to overcome in order to be able to move DRL outside of the niche of games, in more realistic situations

Current DRL Algos Learning Performance

To attain expert human-level performance on tasks such as Atari video games or chess, deep RL systems have required many orders of magnitude more training data than human experts themselves [22]. The critique is indeed applicable to the first wave of deep RL methods, reported beginning around 2013 (e.g., [25]). However, even in the short time since then, important innovations have occurred in deep RL research, which show how the sample efficiency of deep RL can be dramatically increased.

The main point is the first wave of DRL Algos was very slow (i.e. low samples efficiency) but a second wave of algos provided innovative techniques to improve the critical aspect of scaling learning

DRL Learning

Optimization Approach

Optimization as a generic tool for learning
- NN as a generic model for functions to learn
- Objective Function defining a Surface to navigate in search of good parametrizations = minima

Gradient Descent as Engine or Force driving the exploration
Objective Function typically depends on some kind of feedback which is typically provided via a supervision signal, in RL this is called reward and is typically much sparser than in supervised learning
- the reward sparsity makes the learning process hard to scale

How to plan

Predict the cumulative future reward
- the prediction depends on the reward model (the capability of predicting the reward)
Make the decision = perform an action
Observe the reward when it is available
Correct the reward model

State Space

The State Space is the input space for the policy function
It can be -- Memory-less or Markov vs with memory -- Low dim vs high dim (typically related to high dim sensors like images, ...)
Some examples

Panel (a) is low dim and stateless: it is easy to compress the backgammon board status beyond its original pixel representation in a very low dim encoding
Panel (b) is high dim and stateless: the Space Invaders image frame represents all the relevant information (no need for the history) but it is harder than the backgammon case, though not super hard, to define a handcrafted method to compress it beyond its original pixel representation
Panel (c) is high dim and stateful: the Maze image in itself is not sufficient to make a decision as history is very important (hence not Markov), furthermore it is hard to define an handcrafted method to compress the image beyond its pixel representation

Jun 12 '19 13:06 NicolaBernini

Slow DRL

Source of slow learning

Gradient based methods
Inductive Bias: Generality vs Learning Speed Trade-off

Gradient based methods

Can be framed as Exploration vs Exploitation
- At the beginning, when the network is in a Random Config State, and there is “nothing to lose” (i.e. the network has not learned anything yet), then the state space exploration can be faster
- When the network has already learned something, the learning process becomes slower because of the above mentioned tradeoff between: learning new things and maintaining the acquired knowledge
Learning via gradient based methods means smaller and smaller increments in order to be able to check every update does not “break” what has been learned
Goals
- trying to pursue generalization
- trying to avoid overfitting of past knowledge (what has been learned so far)
  - NOTE: past knowledge is not special in any way, it has just been learned first by chance
Furthermore, as there is no prior knowledge about the Loss Function Landscape and the State Space is too big to perform a detailed exploration, the preference is to go for continuous improvements so to avoid breaking anything, so the learning becomes greedier (exploration leads to an increasing risk of loss as the NN accumulates more) and slower

Inductive Bias: Generality vs Learning Speed Trade-off

Inductive Bias represents initial assumptions about the pattern to be learned
- As a consequence, it restricts or shapes the State Space so to make the learning faster and more effective
- At the same time, it makes the learning machine less general
Neural Networks are very general learning machines, in fact training them takes a lot of time / computational power
Deep Neural Networks is a class of Neural Networks relying on Hierarchical Structure Inductive Bias which makes them be effective in solving computer vision tasks

Jun 12 '19 13:06 NicolaBernini

Definitions

Samples Efficiency

Sample efficiency refers to the amount of data required for a learning system to attain any chosen target level of performance.

How much effort, measured in samples, is needed for an agent to learning something

Policy Learning

Representing the policy function in the NN framework to learn it
Use optimization to perform effective learning
NN are differentiable hence it is possible to use algorithms as gradient descent to learn the NN params out of some objective function

Nov 09 '19 16:11 NicolaBernini

High Level Analysis

Dec 26 '19 09:12 NicolaBernini

Tasks Complexity

Dec 26 '19 09:12 NicolaBernini

Episodic Memory

Dec 26 '19 09:12 NicolaBernini

PapersAnalysis PapersAnalysis copied to clipboard

Reinforcement Learning, Fast and Slow

Overview

Index

DRL

Paper Key Points

Current DRL Algos Learning Performance

DRL Learning

Optimization Approach

How to plan

State Space

Slow DRL

Gradient based methods

Inductive Bias: Generality vs Learning Speed Trade-off

Definitions

Samples Efficiency

Policy Learning

High Level Analysis

Tasks Complexity

Episodic Memory

PapersAnalysis
PapersAnalysis copied to clipboard