iclr2021_rlreg
iclr2021_rlreg copied to clipboard
Regularization Matters in Policy Optimization
Regularizations in Policy Optimization - An Empirical Study on Continuous Control
This repository contains the code for:
Regularization Matters for Policy Optimization - An Empirical Study on Continuous Control [arXiv]. Also in ICLR 2021 and NeurIPS 2019 Deep RL Workshop.
Zhuang Liu*, Xuanlin Li*, Bingyi Kang* and Trevor Darrell (* equal contribution)
Our code is adopted from OpenAI Baselines and SAC.
Abstract
Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., $L_2$ regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. Our findings are shown to be robust against training hyperparameter variations. We also compare these techniques with the more widely used entropy regularization. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. We further analyze why regularization may help generalization in RL from four perspectives - sample complexity, reward distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms.
Installation Instructions
Git clone https://github.com/rll/rllab to PATH_TO_RLLAB_FOLDER
Install MuJoCo (but don't install mujoco_py
yet) by following instructions on https://github.com/openai/mujoco-py.
Copy additional_lib_for_rllab/libglfw.so.3, additional_lib_for_rllab/libmujoco131.so
in this repository and mjkey.txt
from the mujoco key path to a new folder named PATH_TO_RLLAB_FOLDER/vendor/mujoco
.
Fix a typo in rllab by
vi PATH_TO_RLLAB_FOLDER/rllab/sampler/stateful_pool.py
Change
"from joblib.pool import MemmapingPool"
to
"from joblib.pool import MemmappingPool"
Set up virtual environment using
virtualenv ENV_NAME --python=python3
source ENV_NAME/bin/activate
Install mujoco_py
for MuJoCo (version 2.0)
by following the instructions on https://github.com/openai/mujoco-py
Next, modify .bashrc
(or set up a shell script named SOMESCRIPT.sh
and source SOMESCRIPT.sh
before training):
export PYTHONPATH=PATH_TO_THIS_REPO/baselines_release:$PYTHONPATH
export PYTHONPATH=PATH_TO_RLLAB_FOLDER:$PYTHONPATH
export PYTHONPATH=PATH_TO_THIS_REPO/sac_release:$PYTHONPATH
Next, install the required packages. Openai baseline also requires that CUDA>=9.0.
pip3 install tensorflow-gpu==(VERSION_THAT_COMPLIES_WITH_CUDA_INSTALLATION, note that tensorflow 2.0.0 is not compatible with this repo)
pip3 install mpi4py roboschool==1.0.48 gym==0.13.0 click dill joblib opencv-python progressbar2 tqdm theano path.py cached_property python-dateutil pyopengl mako gtimer matplotlib pyprind
pip3 install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip
Running
To train, run
cd PATH_TO_THIS_REPO
python -m baselines.run --help
python PATH_TO_REPO/sac_release/examples/mujoco_all_sac.py --help
for the available arguments, such as the number of environments simulated in parallel, model save path, etc.
For Soft Actor Critic, PATH_TO_THIS_REPO/sac_release/examples.variants.py
contains default environment settings. These settings are overwritten by command line arguments.
Regularization Options
l1regpi, l1regvf = L1 policy/value network regularization
l2regpi, l2regvf = L2 policy/value network regularization
wclippi, wclipvf = Policy/value network weight clipping
(Note: for openai baseline policy weight clipping, we only clip the mlp part of
the network because clipping the log standard deviation vector almost always
harms the performance)
dropoutpi, dropoutvf = Policy/value network dropout KEEP_PROB (1.0 = no dropout)
batchnormpi, batchnormvf = Policy/value network batch normalization (True or False)
ent_coef = Entropy regularization coefficient
Examples:
python -m baselines.run --alg=ppo2 --env=RoboschoolHumanoid-v1 --num_timesteps=5e7 --l2regpi=0.0001
Runs ppo2
(Proximal Policy Gradient) on RoboschoolHumanoid
task with 5e7
timesteps with L2 regularization applied to the policy network with strength=0.0001.
python -m baselines.run --alg=a2c --env=Humanoid-v2 --num_timesteps=2e7 --ent_coef=0.0 --batchnormpi=True
Runs a2c
(Synchronous version of A3C) on Humanoid (MuJoCo)
task with 2e7
timesteps with batch normalization applied to the policy network and the entropy regularization turned off.
python sac_release/examples/mujoco_all_sac.py --env=atlas-forward-walk-roboschool --dropoutpi=0.9
Runs sac
(Soft Actor Critic) on RoboschoolAtlasForwardWalk
task with dropout probability = 1 - 0.9 = 0.1 on policy network (i.e. keep probability = 0.9).
(Note that the number of training timesteps is predefined in sac_release/examples/variant.py
)
Citation
@inproceedings{liu2020regularization,
title={Regularization Matters in Policy Optimization-An Empirical Study on Continuous Control},
author={Liu, Zhuang and Li, Xuanlin and Kang, Bingyi and Darrell, Trevor},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=yr1mzrH3IC}
}