Reinforcement Learning: An Introduction

Kotlin implementation of algorithms, examples, and exercises from the Sutton and Barto: Reinforcement Learning (2nd Edition). The purpose of this project is to help understanding RL algorithms and experimenting easily.

Inspired by ShangtongZhang/reinforcement-learning-an-introduction (Python) and idsc-frazzoli/subare (Java 8)

Features:

Algorithms and problems are separated. So you can experiment with various combination of <algorithm, problem> or <algorithm,function approximator, problem>
Implementation is very close to the pseudo code in the book. So reading source code will help you understand the original algorithm.

Implemented algorithms:

Model-based (Dynamic Programming):

Policy Iteration (Action-Value Iteration) (p.65)
Value Iteration (p.67)

Monte Carlo (episode backup):

First-visit MC prediction (p.76)
Monte Carlo Exploring Starts (p.81)
On-Policy first-visit MC control (p.83)
Off-policy MC prediction (p.90)
Off-policy MC control (p.91)

Temporal Difference (one-step backup):

Tabular TD(0) (p.98)
Sarsa (p.106)
Q-learning (p.107)
Expected Sarsa (p.109)
Double Q-Learning (p.111)

n-step Temporal Difference (unify MC and TD):

n-step TD prediction (p.117)
n-step Sarsa (p.120)
Off-policy n-step Sarsa (p.122)
n-step Tree Backup (p.125)
Off-policy n-step Q(σ) (p.128)

Dyna (Integrate Planning, Acting, and Learning):

Random-sample one-step tabular Q-planning (p.133)
Tabular Dyna-Q (p.135)
Tabular Dyna-Q+ (p.138)
Prioritized Sweeping (p.140)
Prioritized Sweeping Stochastic Environment (p.141)

On-policy Prediction with Function Approximation

Gradient Monte Carlo algorithm (p.165)
Semi-gradient TD(0) (p.166)
n-step semi-gradient TD (p.171)
Least-Squares TD (p.186)

On-policy Control with Function Approximation

Episodic semi-gradient Sarsa (p.198)
Episodic semi-gradient n-step Sarsa (p.200)
Differential semi-gradient Sarsa (p.203)
Differential semi-gradient n-step Sarsa (p.206)

Off-policy Methods with Approximation

Semi-gradient off-policy TD(0) (p.210)
Semi-gradient Expected Sarsa (p.210)
n-step semi-gradient off-policy Sarsa (p.211)
n-step semi-gradient off-policy Q(σ) (p.211)

Eligibility Traces

Off-line λ-return (p.238)
Semi-gradient TD(λ) prediction (p.240)
True Online TD(λ) prediction (p.246)
Sarsa(λ) (p.250)
True Online Sarsa(λ) (p.252)

Policy Gradient Methods

REINFORCE, A Monte-Carlo Policy-Gradient Method (episodic) (p.271)
REINFORCE with Baseline (episodic) (p.273)
One-step Actor-Critic (episodic) (p.274)
Actor-Critic with Eligibility Traces (episodic) (p.275)
Actor-Critic with Eligibility Traces (continuing) (p.277)

Implemented problems:

Grid world (p.61)
Jack's Car Rental and exercise 4.4 (p.65)
Gambler's Problem (p.68)
Blackjack (p.76)
Random Walk (p.102)
Windy Gridworld and King's Moves (p.106)
Cliff Walking (p.108)
Maximization Bias Example (p.110)
19-state Random Walk (p.118)
Dyna Maze (p.136)
Rod Maneuvering (p.141)
1000-state Random Walk (p.166)
Mountain Car (p.198)
Access-Control Queuing Task (p.204)

Build

Built with Maven

Test cases

Try Testcases

Figure 7.2

Figure 7.2: Performance of n-step TD methods as acc function of α, for various values of n, on acc 19-state random walk task

Figure 10.1

Figure 10.1: The Mountain Car task and the cost-to-go function learned during one run

Figure 10.4

Figure 10.4: Effect of the α and n on early performance of n-step semi-gradient Sarsa and tile-coding function approximation on the Mountain Car task

Figure 12.3

Figure 12.3: 19-state Random walk results: Performance of the offline λ-return algorithm .

Figure 12.6

Figure 12.6: 19-state Random walk results: Performance of TD(λ) .

Figure 12.8

Figure 12.8: 19-state Random walk results: Performance of online λ-return algorithms

Figure 12.10

Figure 12.10: Early performance on the Mountain Car task of Sarsa(λ) with replacing traces

Figure 12.11

Figure 12.11: Summary comparison of Sarsa(λ) algorithms on the Mountain Car task.

Reinforcement-Learning-An-Introduction
Reinforcement-Learning-An-Introduction copied to clipboard

Metadata

Reinforcement Learning: An Introduction

Implemented algorithms:

Implemented problems:

Build

Test cases

← Metadata

Owner

Metadata

Reinforcement-Learning-An-Introduction Reinforcement-Learning-An-Introduction copied to clipboard

Metadata

Reinforcement Learning: An Introduction

Implemented algorithms:

Implemented problems:

Build

Test cases

← Metadata

Owner

Metadata

Reinforcement-Learning-An-Introduction
Reinforcement-Learning-An-Introduction copied to clipboard