FinRL-Meta
FinRL-Meta copied to clipboard
[Suggestion] Reward helper / env setting
It's really important to not overlook the reward part. Just using return is probably the reason most fail with RL. This might be far more essential than the agents: "The reward fed to the RL agent is completely governing its behavior, so a wise choice of the reward shaping function is critical for good performance. There are quite a number of rewards one can choose from or combine, from risk-based measures, to profitability or cumulative return, number of trades per interval, etc. The RL framework accepts any sort of rewards, the denser the better."
Great paper giving a nice overview over different reward functions: https://arxiv.org/abs/2004.06985
Chapter 4 Reward Functions
- PnL-based Rewards (Unrealized PnL, Unrealized PnL with Realized Fills, Asymmetrical Unrealized PnL with Realized Fills, Asymmetrical Unrealized PnL with Realized Fills and Ceiling, Realized PnL Change)
- Goal-based Rewards (Trade Completion)
- Risk-based Rewards (Differential Sharpe Ratio)
This is a great example (environment) using a parameter for reward selection: https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/gym_trading/envs/base_environment.py The code for the reward functions used: https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/gym_trading/utils/reward.py It would be a great improvement of the above example, if one would be able to easily combine multiple reward functions into one, too.
More notes I made in my research about reward:
- Considering immediate and long-term reward See 2.1.3 in https://arxiv.org/ftp/arxiv/papers/1907/1907.04373.pdf - In our FMDP, we utilize both the immediate and long-term reward.
- Differential Sharpe Ratio (Sharpe ratio reward step by step) - https://proceedings.neurips.cc/paper/1998/file/4e6cd95227cb0c280e99a195be5f6615-Paper.pdf
- Deflated Sharpe Ratio - M López De Prado -https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2465675 The Deflated Sharpe Ratio (DSR) corrects for two leading sources of performance inflation: Non-Normally distributed returns. Selection bias under multiple testing.
- Smart Sharpe: https://www.keyquant.com/Download/GetFile?Filename=%5CPublications%5CKeyQuant_WhitePaper_APT_Part2.pdf - The superior predictive power of the Smart Sharpe Ratio offers much better drawdown control while preserving the Sharpe Ratio
- Penalize (too long) holding.
- Optimal total of trades (a rough estimation on how often the model should trade - 1D timeframe vs. 5m timeframe should be traded with a different frequency)
- Reward in relation to buy-and-hold / penalize if it doesn't beat buy-and-hold return of the day
Thanks for your valuable suggestions. We will test these reward functions and try to add support for them in the future version.
Is there support for other reward functions besides return?