Reinforcement Learning

Reinforcing Your Learning of Reinforcement Learning.

这个是我在学习强化学习的过程中的一些记录，以及写的一些代码。建立这个Github项目主要是可以和大家一起相互学习和交流，也同时方便其他人寻找强化学习方面的资料。我为什么学习强化学习，主要是想把 AlphaZero 的那套方法（结合深度学习的蒙特卡洛树搜索）用在 RNA 分子结构预测上，目前已经做了一些尝试，比如寻找 RNA 分子的二级结构折叠路径。

首先看的书是 Richard S. Sutton 和 Andrew G. Barto 的 Reinforcement Learning: An Introduction (Second edition)。

看书的同时，也根据网上的一些文章写一些简单的代码，依次如下。

Q-Learning
- Frozen Lake Game
- Tic Tac Toe
- Taxi v2
Deep Q-Learning Network (DQN)
- Doom Game
- Atari Space Invaders
Dueling Double DQN & Prioritized Experience Replay
- Doom Deadly Corridor
Policy Gradients (PG)
- CartPole Game
- Doom Deathmatch
Advantage Actor Critic (A2C)
Asynchronous Advantage Actor Critic (A3C)
Proximal Policy Optimization (PPO)
- Half Cheetah
Deep Deterministic Policy Gradient (DDPG)
- Ant
AlphaGoZero Introduction
Monte Carlo Tree Search (MCTS)
- Gomoku
AlphaGomoku
RNA Folding Path
Atari Game Roms

Q-Learning

Bellman equation:

Frozen Lake Game

基于 Q-Learning 玩 Frozen Lake 游戏：[code]

Tic Tac Toe

基于 Q-Learning 玩井字棋游戏：[code]

训练结果：

Q-Learning Player vs Q-Learning Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.45383
Q-Learning win rate: 0.3527
players draw rate: 0.19347
====================

Q-Learning Player vs Random Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.874
Random win rate: 0.03072
players draw rate: 0.09528
====================

Taxi v2

基于 Q-Learning 玩 Taxi v2 游戏：[code]

[0]. Diving deeper into Reinforcement Learning with Q-Learning
[1]. Q* Learning with FrozenLake - Notebook
[2]. Q* Learning with OpenAI Taxi-v2 - Notebook

Deep Q-Learning Network

weights updation:

Doom Game

游戏环境这里使用的是 ViZDoom ，神经网络是三层的卷积网络。[code]

DQN neural network

训练大约 1200 轮后结果如下：

Doom loss

Episode 0 Score: 61.0
Episode 1 Score: 68.0
Episode 2 Score: 51.0
Episode 3 Score: 62.0
Episode 4 Score: 56.0
Episode 5 Score: 33.0
Episode 6 Score: 86.0
Episode 7 Score: 57.0
Episode 8 Score: 88.0
Episode 9 Score: 61.0
[*] Average Score: 62.3

Atari Space Invaders

游戏环境使用的是 Gym Retro ，神经网络见下图。[code]

DQN neural network

训练大约 25 局后结果如下：

[*] Episode: 11, total reward: 120.0, explore p: 0.7587, train loss: 0.0127
[*] Episode: 12, total reward: 80.0, explore p: 0.7495, train loss: 0.0194
[*] Episode: 13, total reward: 110.0, explore p: 0.7409, train loss: 0.0037
[*] Episode: 14, total reward: 410.0, explore p: 0.7233, train loss: 0.0004
[*] Episode: 15, total reward: 240.0, explore p: 0.7019, train loss: 0.0223
[*] Episode: 16, total reward: 230.0, explore p: 0.6813, train loss: 0.0535
[*] Episode: 17, total reward: 315.0, explore p: 0.6606, train loss: 9.7144
[*] Episode: 18, total reward: 140.0, explore p: 0.6455, train loss: 0.0022
[*] Episode: 19, total reward: 310.0, explore p: 0.6266, train loss: 1.5386
[*] Episode: 20, total reward: 200.0, explore p: 0.6114, train loss: 1.5545
[*] Episode: 21, total reward: 65.0, explore p: 0.6044, train loss: 0.0042
[*] Episode: 22, total reward: 210.0, explore p: 0.5895, train loss: 0.0161
[*] Episode: 23, total reward: 155.0, explore p: 0.5778, train loss: 0.0006
[*] Episode: 24, total reward: 105.0, explore p: 0.5665, train loss: 0.0016
[*] Episode: 25, total reward: 425.0, explore p: 0.5505, train loss: 0.0063

[0]. An introduction to Deep Q-Learning: let’s play Doom
[1]. Deep Q learning with Doom - Notebook
[2]. Deep Q Learning with Atari Space Invaders
[3]. Atari 2600 VCS ROM Collection

Dueling Double DQN and Prioritized Experience Replay

Four improvements in Deep Q Learning:

Fixed Q-targets
Double DQN
Dueling DQN
Prioritized Experience Replay

Doom Deadly Corridor

其中，Dueling DQN 的神经网络如下图: [code]

Dueling DQN

Prioritized Experience Replay 采用 SumTree 的方法:

SumTree

[0]. Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets
[1]. Let’s make a DQN: Double Learning and Prioritized Experience Replay
[2]. Double Dueling Deep Q Learning with Prioritized Experience Replay - Notebook

Policy Gradients

CartPole Game

其中，Policy Gradient 神经网络如下图。

Policy Gradient Network

训练大约 950 轮后结果如下：

====================
Episode: 941
Reward: 39712.0
Mean Reward: 2246.384288747346
Max reward so far: 111837.0
====================
Episode: 942
Reward: 9417.0
Mean Reward: 2253.9883351007425
Max reward so far: 111837.0
====================
Episode: 943
Reward: 109958.0
Mean Reward: 2368.08156779661
Max reward so far: 111837.0
====================
Episode: 944
Reward: 73285.0
Mean Reward: 2443.125925925926
Max reward so far: 111837.0
====================
Episode: 945
Reward: 40370.0
Mean Reward: 2483.217758985201
Max reward so far: 111837.0
[*] Model Saved: ./model/model.ckpt

具体代码请参见：[tensorflow] [pytorch]

Doom Deathmatch

神经网络如上，具体代码请参见：[code]

[0]. An introduction to Policy Gradients with Cartpole and Doom
[1]. Cartpole: REINFORCE Monte Carlo Policy Gradients - Notebook
[2]. Doom-Deathmatch: REINFORCE Monte Carlo Policy gradients - Notebook
[3]. Deep Reinforcement Learning: Pong from Pixels

Advantage Actor Critic

[to be done]

Asynchronous Advantage Actor Critic

[to be done]

Proximal Policy Optimization

Half Cheetah

训练 500 epoch 后：

----------------------------------------
Epoch: 499
TotalEnvInteracts: 2000000
EpRet:        585.4069  470.6009(min) 644.4069(max) 67.6205(std)
EpLen:        1000.0000
VVals:        46.3796   23.6165(min) 50.2677(max) 2.5903(std)
LossPi:       -0.0172
LossV:        506.2033
DeltaLossPi:  -0.0172
DeltaLossV:   -32.9680
Entropy:      5.5010
KL:           0.0188
ClipFrac:     0.1937
StopIter:     6.0000
Time: 27175.5427s
----------------------------------------

具体代码请参见：[code]

[0]. OpenAI Spinning Up - Proximal Policy Optimization

Deep Deterministic Policy Gradient

Ant

由于训练不稳定，在第 225 epoch 取得最大平均 reward：

----------------------------------------
Epoch: 225
TotalEnvInteracts: 1130000
EpRet:        1358.9755 920.2474(min) 1608.4678(max) 238.5494(std)
EpLen:        992.0000
TestEpRet:    1101.4177 479.9980(min) 1520.0907(max) 415.7242(std)
TestEpLen:    873.2000
QVals:        136.3016 -112.4969(min) 572.5870(max) 36.0409(std)
LossPi:       -138.1635
LossQ:        7.0895
Time: 28183.4149s
----------------------------------------

随着时间的增长，平均 reward 波动较大，此起彼伏，训练 365 epoch 后：

----------------------------------------
Epoch: 365
TotalEnvInteracts: 1830000
EpRet:        -1250.8838 -2355.5800(min) -10.4664(max) 810.3868(std)
EpLen:        723.5714
TestEpRet:    -1241.4192 -2211.0383(min) -884.2655(max) 342.6774(std)
TestEpLen:    1000.0000
QVals:        407.9140 -116.6802(min) 684.3555(max) 76.7627(std)
LossPi:       -413.1655
LossQ:        61.5379
Time: 50710.5035s
----------------------------------------

具体代码请参见：[code]

[0]. OpenAI Spinning Up - Deep Deterministic Policy Gradient

AlphaGoZero Introduction

这个是我通过阅读 AlphaGo Zero 的文献，以及结合网路上相关的一些文章，将这些内容通过自己的理解整合到这一个PPT中，用来在组会上简单的介绍 AlphaGo Zero 背后的方法和原理给同学和老师，同时也思考如何将其结合到其他领域。当然，其中也不仅仅包括 AlphaGo Zero 的内容，也有我最近看的另外一篇文章，他们的研究团队运用类似的方法来解魔方。[pdf]

[0]. AlphaGo Zero - How and Why it Works
[1]. Alpha Go Zero Cheat Sheet
[2]. Mastering the game of Go with deep neural networks and tree search
[3]. Mastering the game of Go without Human Knowledge

Monte Carlo Tree Search

Gomoku

MCTS vs Random Player [code]. Another MCTS on Tic Tac Toe [code].

[0]. mcts.ai
[1]. Introduction to Monte Carlo Tree Search

AlphaGomoku

使用AlphaGo Zero的方法实现的一个五子棋AI。

下图是自我博弈训练 3000 局棋后，与人类选手对局的结果，已经很难下赢了。

策略估值网络提供了两个模型，分别是：

################
# Residual_CNN #
################

Network Diagram:
                |-----------------------|                  /---C---B---R---F---D---R---D---T [value head]
I---C---B---R---o---C---B---R---C---B---M---R--- ..... ---|
    \_______/     \_______________________/                \---C---B---R---F---D---S [polich head]
   [Conv layer]       [Residual layer]

I - input
B - BatchNormalization
R - Rectifier non-linearity, LeakyReLU
T - tanh
C - Conv2D
F - Flatten
D - Dense
M - merge, add
S - Softmax
O - output

##############
# Simple_CNN #
##############

Network Diagram:
                                               2(1x1)       64     1
    32(3x3)     64(3x3)    128(3x3)        /-----C-----F-----D-----D-----T  [value head]
I-----C-----R-----C-----R-----C-----R-----|
      \_____________________________/      \-----C-----F-----D-----S        [polich head]
           [Convolutional layer]               4(1x1)       w^2

I - input
B - BatchNormalization
R - ReLU
T - tanh
C - Conv2D
F - Flatten
D - Dense
S - Softmax

8x8 大小棋盘自我博弈训练 3000 局的结果如下：

[*] Episode: 2991, length: 42, start: O, winner: X, data: 336, time: 85s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.7491 - value_head_loss: 0.4658 - policy_head_loss: 1.0655
[*] Episode: 2992, length: 19, start: O, winner: O, data: 152, time: 40s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6507 - value_head_loss: 0.4631 - policy_head_loss: 0.9698
[*] Episode: 2993, length: 23, start: X, winner: X, data: 184, time: 47s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6409 - value_head_loss: 0.4322 - policy_head_loss: 0.9908
[*] Episode: 2994, length: 35, start: X, winner: X, data: 280, time: 71s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6128 - value_head_loss: 0.4528 - policy_head_loss: 0.9421
[*] Episode: 2995, length: 16, start: X, winner: O, data: 128, time: 35s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.7529 - value_head_loss: 0.4884 - policy_head_loss: 1.0466
[*] Episode: 2996, length: 22, start: O, winner: X, data: 176, time: 46s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6800 - value_head_loss: 0.4583 - policy_head_loss: 1.0038
[*] Episode: 2997, length: 16, start: X, winner: O, data: 128, time: 35s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6877 - value_head_loss: 0.4973 - policy_head_loss: 0.9725
[*] Episode: 2998, length: 22, start: X, winner: O, data: 176, time: 48s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6530 - value_head_loss: 0.4887 - policy_head_loss: 0.9464
[*] Episode: 2999, length: 16, start: X, winner: O, data: 128, time: 33s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6951 - value_head_loss: 0.4582 - policy_head_loss: 1.0189
[*] Episode: 3000, length: 9, start: X, winner: X, data: 72, time: 18s, win ratio: X 48.1%, O 48.5%, - 3.4%
Epoch 1/1
512/512 [==============================] - 1s 2ms/step - loss: 1.6760 - value_head_loss: 0.4743 - policy_head_loss: 0.9838

具体代码及训练好的模型参数请参考这里：[code]

[0]. How to build your own AlphaZero AI using Python and Keras
[1]. Github: AppliedDataSciencePartners/DeepReinforcementLearning
[2]. Github: Rochester-NRT/RocAlphaGo
[3]. 28 天自制你的 AlphaGo (6) : 蒙特卡洛树搜索（MCTS）基础
[4]. AlphaZero实战：从零学下五子棋（附代码）
[5]. Github: junxiaosong/AlphaZero_Gomoku

RNA Folding Path

使用深度强化学习来学习 RNA 分子的二级结构折叠路径。具体说明这里就不再重复了，请参见这里：[link]

Atari Game Roms

这里有一些 Atari 游戏的 Rom，可以导入到 retro 环境中，方便进行游戏。[link]

ReinforcementLearning
ReinforcementLearning copied to clipboard

Metadata

Reinforcement Learning

Table of contents

Q-Learning

Frozen Lake Game

Tic Tac Toe

Taxi v2

Deep Q-Learning Network

Doom Game

Atari Space Invaders

Dueling Double DQN and Prioritized Experience Replay

Doom Deadly Corridor

Policy Gradients

CartPole Game

Doom Deathmatch

Advantage Actor Critic

Asynchronous Advantage Actor Critic

Proximal Policy Optimization

Half Cheetah

Deep Deterministic Policy Gradient

Ant

AlphaGoZero Introduction

Monte Carlo Tree Search

Gomoku

AlphaGomoku

RNA Folding Path

Atari Game Roms

← Metadata

Owner

Metadata

ReinforcementLearning ReinforcementLearning copied to clipboard

Metadata

Reinforcement Learning

Table of contents

Q-Learning

Frozen Lake Game

Tic Tac Toe

Taxi v2

Deep Q-Learning Network

Doom Game

Atari Space Invaders

Dueling Double DQN and Prioritized Experience Replay

Doom Deadly Corridor

Policy Gradients

CartPole Game

Doom Deathmatch

Advantage Actor Critic

Asynchronous Advantage Actor Critic

Proximal Policy Optimization

Half Cheetah

Deep Deterministic Policy Gradient

Ant

AlphaGoZero Introduction

Monte Carlo Tree Search

Gomoku

AlphaGomoku

RNA Folding Path

Atari Game Roms

← Metadata

Owner

Metadata

ReinforcementLearning
ReinforcementLearning copied to clipboard