pg_travel
pg_travel copied to clipboard
Policy Gradient algorithms (REINFORCE, NPG, TRPO, PPO)
Policy Gradient (PG) Algorithms

This repository contains PyTorch (v0.4.0) implementations of typical policy gradient (PG) algorithms.
- Vanilla Policy Gradient [1]
- Truncated Natural Policy Gradient [4]
- Trust Region Policy Optimization [5]
- Proximal Policy Optimization [7].
We have implemented and trained the agents with the PG algorithms using the following benchmarks. Trained agents and Unity ml-agent environment source files will soon be available in our repo!
- mujoco-py: https://github.com/openai/mujoco-py
- Unity ml-agent: https://github.com/Unity-Technologies/ml-agents
For reference, solid reviews of below papers related to PG (in Korean) are located in https://reinforcement-learning-kr.github.io/2018/06/29/0_pg-travel-guide/. Enjoy!
- [1] R. Sutton, et al., "Policy Gradient Methods for Reinforcement Learning with Function Approximation", NIPS 2000.
- [2] D. Silver, et al., "Deterministic Policy Gradient Algorithms", ICML 2014.
- [3] T. Lillicrap, et al., "Continuous Control with Deep Reinforcement Learning", ICLR 2016.
- [4] S. Kakade, "A Natural Policy Gradient", NIPS 2002.
- [5] J. Schulman, et al., "Trust Region Policy Optimization", ICML 2015.
- [6] J. Schulman, et al., "High-Dimensional Continuous Control using Generalized Advantage Estimation", ICLR 2016.
- [7] J. Schulman, et al., "Proximal Policy Optimization Algorithms", arXiv, https://arxiv.org/pdf/1707.06347.pdf.
Table of Contents
- Policy Gradient (PG) Algorithms
- Mujoco-py
- 1. Installation
- 2. Train
- Basic Usage
- Continue training from the saved checkpoint
- Test the pretrained model
- Modify the hyperparameters
- 3. Tensorboard
- 4. Trained Agent
- Unity ml-agents
- 1. Installation
- 2. Environments
- 3. Train
- Basic Usage
- Continue training from the saved checkpoint
- Test the pretrained model
- Modify the hyperparameters
- 4. Tensorboard
- 5. Trained Agent
- Reference
- Mujoco-py
Mujoco-py
1. Installation
2. Train
Navigate to pg_travel/mujoco folder
Basic Usage
Train the agent with PPO using Hopper-v2 without rendering.
python main.py
- Note that models are saved in
save_modelfolder automatically for every 100th iteration.
Train the agent with TRPO using HalfCheetah with rendering
python main.py --algorithm TRPO --env HalfCheetah-v2 --render
- algorithm: PG, TNPG, TRPO, PPO(default)
- env: Ant-v2, HalfCheetah-v2, Hopper-v2(default), Humanoid-v2, HumanoidStandup-v2, InvertedPendulum-v2, Reacher-v2, Swimmer-v2, Walker2d-v2
Continue training from the saved checkpoint
python main.py --load_model ckpt_736.pth.tar
- Note that
ckpt_736.pth.tarfile should be in thepg_travel/mujoco/save_modelfolder. - Pass the arguments
algorithmand/orenvif notPPOand/orHopper-v2.
Test the pretrained model
Play 5 episodes with the saved model ckpt_738.pth.tar
python test_algo.py --load_model ckpt_736.pth.tar --iter 5
- Note that
ckpt_736.pth.tarfile should be in thepg_travel/mujoco/save_modelfolder. - Pass the arguments
envif notHopper-v2.
Modify the hyperparameters
Hyperparameters are listed in hparams.py.
Change the hyperparameters according to your preference.
3. Tensorboard
We have integrated TensorboardX to observe training progresses.
- Note that the results of trainings are automatically saved in
logsfolder. - TensorboardX is the Tensorboard-like visualization tool for Pytorch.
Navigate to the pg_travel/mujoco folder
tensorboard --logdir logs
4. Trained Agent
We have trained the agents with four different PG algortihms using Hopper-v2 env.
| Algorithm | Score | GIF |
|---|---|---|
| Vanilla PG | ![]() |
![]() |
| NPG | ![]() |
![]() |
| TRPO | ![]() |
![]() |
| PPO | ![]() |
![]() |
Unity ml-agents
1. Installation
2. Environments
We have modified Walker environment provided by Unity ml-agents.
| Overview | image |
|---|---|
| Walker | ![]() |
| Plane Env | ![]() |
| Curved Env | ![]() |
Description
- 212 continuous observation spaces
- 39 continuous action spaces
- 16 walker agents in both Plane and Curved envs
Reward- +0.03 times body velocity in the goal direction.
- +0.01 times head y position.
- +0.01 times body direction alignment with goal direction.
- -0.01 times head velocity difference from body velocity.
- +1000 for reaching the target
Done- When the body parts other than the right and left foots of the walker agent touch the ground or walls
- When the walker agent reaches the target
- Contains Plane and Curved walker environments for Linux / Mac / Windows!
- Linux headless envs are also provided for faster training and server-side training.
- Download the corresponding environments, unzip, and put them in the
pg_travel/unity/envfolder.
3. Train
Navigate to the pg_travel/unity folder
Basic Usage
Train walker agent with PPO using Plane environment without rendering.
python main.py --train
- The PPO implementation is for multi-agent training. Collecting experiences from multiple agents and using them for training the global policy and value networks (brain) are included. Refer to
pg_travel/mujoco/agent/ppo_gae.pyfor just single-agent training. - See arguments in main.py. You can change hyper parameters for the ppo algorithm, network architecture, etc.
- Note that models are saved in
save_modelfolder automatically for every 100th iteration.
Continue training from the saved checkpoint
python main.py --load_model ckpt_736.pth.tar --train
- Note that
ckpt_736.pth.tarfile should be in thepg_travel/unity/save_modelfolder.
Test the pretrained model
python main.py --render --load_model ckpt_736.pth.tar
- Note that
ckpt_736.pth.tarfile should be in thepg_travel/unity/save_modelfolder.
Modify the hyperparameters
See main.py for default hyperparameter settings.
Pass the hyperparameter arguments according to your preference.
4. Tensorboard
We have integrated TensorboardX to observe training progresses.
Navigate to the pg_travel/unity folder
tensorboard --logdir logs
5. Trained Agent
We have trained the agents with PPO using plane and curved envs.
| Env | GIF |
|---|---|
| Plane | ![]() |
| Curved | ![]() |
Reference
We referenced the codes from below repositories.












