stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Feature Request] [V1.1] Standardization of intrinsic rewards

Open balloch opened this issue 4 years ago • 1 comments

🚀 Feature

Standardization of intrinsic rewards, and a single parameter to turn off extrinsic rewards

Motivation

Historically "rewards" have been modeled as being strictly intrinsic, per the theoretical foundations of MDP optimization algorithms like RL. People have always talked about changes to the reward function whether it be through reward shaping or learning the reward function with apprenticeship learning or the like. However, with the increased interest in intrinsic rewards, "self-supervised" RL, "unsupervised RL", and curiosity--which are all somewhat related subareas--it would be beneficial.

Pitch

If Stable-Baselines standardized the way in which intrinsic rewards the immediate effect would be that reward shaping is shifted to the agent, where it should be. Additionally, this would make the addition of any future implementations of these types of algorithms far easier as there would be a standard format. This also may create greater exposure and usage of SB3 to real world applications because this implementation of these popular algorithms would reflect what occurs in most real world control applications, namely that extrinsic rewards are few and far between and most of the time the reward or cost function designed or learned by the practitioner is specific to the agent. Optionally, per the interest of the researchers in this area, one should also be able to fully switch off the extrinsic rewards from the environment i.e. have them always return 0 (I know full well that you can just no use the rewards you get from the environment but this seems...cleaner).

To do this, there would have to be agreement on the following:

  • where within the learn/train cycle this occurs
  • how to template the input and output of the intrinsic reward function so that it applies to all SB3 algorithms
  • the time/debugging/realism of wrapping Gym environments in such a way as to be able to turn off reward

Alternatives

The alternative is do nothing, which is fine, because this is a minor feature

Additional context

None

### Checklist

  • [ ] I have checked that there is no similar issue in the repo, though it is related to #538 . However this request is neither formatted correctly nor elaborated on.
  • [ ] Decide on location of intrinsic reward and its format (see Pitch)
  • [ ] Decide if there should exist an 'exploration abstraction template' of which 'intrinsic reward' is simply a child implementation so that other, non-reward related intrinsic motivation concepts can all be handled similarly (my instinct says we should not do this as it would be trying to do too much)
  • [ ] Implement the intrinsic reward template and add it to all current algorithms
  • [ ] Update documentation

Please let me know if there is anything else we need to add to this checklist! I don't have time to directly contribute to the implementation of this until after the IJCAI, but I wanted to start the dialogue now in anticipation of CoRL, IROS, RLDM, ECCV, and NeurIPS, not to mention all of the intrinsic motivation papers that have/will be coming out of AAAI, ICLR, and CVPR

balloch avatar Nov 16 '21 18:11 balloch

Hello, Thanks for the proposal. Correct me if I'm wrong but intrinsic rewards are simply added (with some scale factor) to the reward of the environment? In that case, providing gym wrappers (or VecEnvWrapper) would be the way to go? (see https://github.com/hill-a/stable-baselines/issues/309)

If so, I'll move this issue to the contrib repo: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib It is made for such feature request (please read the SB3 contrib contribution guide too).

araffin avatar Nov 16 '21 21:11 araffin

closing as outside the scope of SB3

araffin avatar Sep 14 '22 20:09 araffin