FinRL-Meta icon indicating copy to clipboard operation
FinRL-Meta copied to clipboard

[Suggestion] Reward helper / env setting

Open cryptocoinserver opened this issue 3 years ago • 2 comments

It's really important to not overlook the reward part. Just using return is probably the reason most fail with RL. This might be far more essential than the agents: "The reward fed to the RL agent is completely governing its behavior, so a wise choice of the reward shaping function is critical for good performance. There are quite a number of rewards one can choose from or combine, from risk-based measures, to profitability or cumulative return, number of trades per interval, etc. The RL framework accepts any sort of rewards, the denser the better."

Great paper giving a nice overview over different reward functions: https://arxiv.org/abs/2004.06985

Chapter 4 Reward Functions

  • PnL-based Rewards (Unrealized PnL, Unrealized PnL with Realized Fills, Asymmetrical Unrealized PnL with Realized Fills, Asymmetrical Unrealized PnL with Realized Fills and Ceiling, Realized PnL Change)
  • Goal-based Rewards (Trade Completion)
  • Risk-based Rewards (Differential Sharpe Ratio)

This is a great example (environment) using a parameter for reward selection: https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/gym_trading/envs/base_environment.py The code for the reward functions used: https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/gym_trading/utils/reward.py It would be a great improvement of the above example, if one would be able to easily combine multiple reward functions into one, too.

More notes I made in my research about reward:

  • Considering immediate and long-term reward See 2.1.3 in https://arxiv.org/ftp/arxiv/papers/1907/1907.04373.pdf - In our FMDP, we utilize both the immediate and long-term reward.
  • Differential Sharpe Ratio (Sharpe ratio reward step by step) - https://proceedings.neurips.cc/paper/1998/file/4e6cd95227cb0c280e99a195be5f6615-Paper.pdf
  • Deflated Sharpe Ratio - M López De Prado -https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2465675 The Deflated Sharpe Ratio (DSR) corrects for two leading sources of performance inflation: Non-Normally distributed returns. Selection bias under multiple testing.
  • Smart Sharpe: https://www.keyquant.com/Download/GetFile?Filename=%5CPublications%5CKeyQuant_WhitePaper_APT_Part2.pdf - The superior predictive power of the Smart Sharpe Ratio offers much better drawdown control while preserving the Sharpe Ratio
  • Penalize (too long) holding.
  • Optimal total of trades (a rough estimation on how often the model should trade - 1D timeframe vs. 5m timeframe should be traded with a different frequency)
  • Reward in relation to buy-and-hold / penalize if it doesn't beat buy-and-hold return of the day

cryptocoinserver avatar Dec 04 '21 16:12 cryptocoinserver

Thanks for your valuable suggestions. We will test these reward functions and try to add support for them in the future version.

rayrui312 avatar Dec 04 '21 17:12 rayrui312

Is there support for other reward functions besides return?

ildefons avatar May 05 '23 14:05 ildefons