Reinforcement-Learning-An-Introduction icon indicating copy to clipboard operation
Reinforcement-Learning-An-Introduction copied to clipboard

Kotlin implementation of algorithms, examples, and exercises from the Sutton and Barto: Reinforcement Learning (2nd Edition)

Reinforcement Learning: An Introduction

Kotlin implementation of algorithms, examples, and exercises from the Sutton and Barto: Reinforcement Learning (2nd Edition). The purpose of this project is to help understanding RL algorithms and experimenting easily.

Inspired by ShangtongZhang/reinforcement-learning-an-introduction (Python) and idsc-frazzoli/subare (Java 8)

Features:

  • Algorithms and problems are separated. So you can experiment with various combination of <algorithm, problem> or <algorithm,function approximator, problem>
  • Implementation is very close to the pseudo code in the book. So reading source code will help you understand the original algorithm.

Implemented algorithms:

Model-based (Dynamic Programming):

  • Policy Iteration (Action-Value Iteration) (p.65)
  • Value Iteration (p.67)

Monte Carlo (episode backup):

  • First-visit MC prediction (p.76)
  • Monte Carlo Exploring Starts (p.81)
  • On-Policy first-visit MC control (p.83)
  • Off-policy MC prediction (p.90)
  • Off-policy MC control (p.91)

Temporal Difference (one-step backup):

  • Tabular TD(0) (p.98)
  • Sarsa (p.106)
  • Q-learning (p.107)
  • Expected Sarsa (p.109)
  • Double Q-Learning (p.111)

n-step Temporal Difference (unify MC and TD):

  • n-step TD prediction (p.117)
  • n-step Sarsa (p.120)
  • Off-policy n-step Sarsa (p.122)
  • n-step Tree Backup (p.125)
  • Off-policy n-step Q(σ) (p.128)

Dyna (Integrate Planning, Acting, and Learning):

  • Random-sample one-step tabular Q-planning (p.133)
  • Tabular Dyna-Q (p.135)
  • Tabular Dyna-Q+ (p.138)
  • Prioritized Sweeping (p.140)
  • Prioritized Sweeping Stochastic Environment (p.141)

On-policy Prediction with Function Approximation

  • Gradient Monte Carlo algorithm (p.165)
  • Semi-gradient TD(0) (p.166)
  • n-step semi-gradient TD (p.171)
  • Least-Squares TD (p.186)

On-policy Control with Function Approximation

  • Episodic semi-gradient Sarsa (p.198)
  • Episodic semi-gradient n-step Sarsa (p.200)
  • Differential semi-gradient Sarsa (p.203)
  • Differential semi-gradient n-step Sarsa (p.206)

Off-policy Methods with Approximation

  • Semi-gradient off-policy TD(0) (p.210)
  • Semi-gradient Expected Sarsa (p.210)
  • n-step semi-gradient off-policy Sarsa (p.211)
  • n-step semi-gradient off-policy Q(σ) (p.211)

Eligibility Traces

  • Off-line λ-return (p.238)
  • Semi-gradient TD(λ) prediction (p.240)
  • True Online TD(λ) prediction (p.246)
  • Sarsa(λ) (p.250)
  • True Online Sarsa(λ) (p.252)

Policy Gradient Methods

  • REINFORCE, A Monte-Carlo Policy-Gradient Method (episodic) (p.271)
  • REINFORCE with Baseline (episodic) (p.273)
  • One-step Actor-Critic (episodic) (p.274)
  • Actor-Critic with Eligibility Traces (episodic) (p.275)
  • Actor-Critic with Eligibility Traces (continuing) (p.277)

Implemented problems:

  • Grid world (p.61)
  • Jack's Car Rental and exercise 4.4 (p.65)
  • Gambler's Problem (p.68)
  • Blackjack (p.76)
  • Random Walk (p.102)
  • Windy Gridworld and King's Moves (p.106)
  • Cliff Walking (p.108)
  • Maximization Bias Example (p.110)
  • 19-state Random Walk (p.118)
  • Dyna Maze (p.136)
  • Rod Maneuvering (p.141)
  • 1000-state Random Walk (p.166)
  • Mountain Car (p.198)
  • Access-Control Queuing Task (p.204)

Build

Built with Maven

Test cases

Try Testcases

Figure 7.2

Figure 7.2: Performance of n-step TD methods as acc function of α, for various values of n, on acc 19-state random walk task


Figure 10.1

Figure 10.1: The Mountain Car task and the cost-to-go function learned during one run


Figure 10.4

Figure 10.4: Effect of the α and n on early performance of n-step semi-gradient Sarsa and tile-coding function approximation on the Mountain Car task


Figure 12.3

Figure 12.3: 19-state Random walk results: Performance of the offline λ-return algorithm .


Figure 12.6

Figure 12.6: 19-state Random walk results: Performance of TD(λ) .


Figure 12.8

Figure 12.8: 19-state Random walk results: Performance of online λ-return algorithms


Figure 12.10

Figure 12.10: Early performance on the Mountain Car task of Sarsa(λ) with replacing traces


Figure 12.11

Figure 12.11: Summary comparison of Sarsa(λ) algorithms on the Mountain Car task.