RewardShifting
RewardShifting copied to clipboard
Code for NeurIPS 2022 paper Exploiting Reward Shifting in Value-Based Deep RL
π How To Design A Reward Function For Your Reinforcement Learning Task (In Value-Based RL)?
- To boost exploration, you should use negative rewards, such that the agent will visit more unvisited state-action pairs.
- To boost exploitation, you should use positive rewards, such that the agent will repeatedly visit previously visited state-action pairs.
Our paper provides a detailed analysis of how reward design affects the learning process.
This repo is related to the topic of
- Reward Design in Deep RL
- Reward Design for Better Exploration
- Ensemble in Deep Reinforcement Learning
- Diversity Boosting in Q-Value Network Ensemble
- Offline-RL (conservation via reward shifting)
- Value-Based Deep-RL
π Let us Exploit Reward Shifting in Value-Based Deep-RL!
π Project Page
Key Insight: A positive reward shifting leads to conservative exploitation, and a negative reward shifting leads to curiosity-driven exploration.
ποΈ Reproduction & Basic Usage:
To reproduce our results, please follow instructions in each folder. Actually, the easiest way of reproduction is to play with reward shifting!
π§π»βπ» In your tasks with value-based DRL, please just try to add a line right after the line of interaction with your environment, e.g.,
next_s, r, done, info = env.step(a)
r = r + args.shifting_constant
βDon't forget to remove such a shift in evaluating your policy :)
π‘ Potential Ideas
Here are several potential extensions of our work:
- Theoretically, the guidance of choosing shifting constant values.
- Methodologically, the choice of ensemble bias values
- Empirically, combining upper and lower bound (as non-linear combination) with Thompson Sampling for better exploration.
- Other linear reward shaping, e.g., with non-trivial scaling factor k.
π BibTex
@article{sun2022exploit,
title={Exploit Reward Shifting in Value-Based Deep-RL: Optimistic Curiosity-Based Exploration and Conservative Exploitation via Linear Reward Shaping},
author={Sun, Hao and Han, Lei and Yang, Rui and Ma, Xiaoteng and Guo, Jian and Zhou, Bolei},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={37719--37734},
year={2022}
}