ray
ray copied to clipboard
Rnd with extra value heads
Why are these changes needed?
Exploration algorithms become more important as we move to more complex environments that have difficult reward functions (e.g. sparse ones). Some exploration algorithms have shown promising results on the baselines, among them Curiosity
which is already implemented in RLlib
. RND
(Random Network Distillation) is another such algorithm improving upon curiosity-driven approaches by addressing explicitly (1) the Noisy-TV problem in which a curious agent becomes trapped by its own curiosity (as it finds a source of randomness and observes it) and (2) the performance, resource-usage, and simplicity as it uses only a single prediction network for distillation. As distillation itself works very reliably the algorithm also profits from a high stability.
This PR intends to implement RND
in a user-friendly way defined as being simply "plugged in" instead of subclassing a Policy
(more about this below). This does not come without some implementation challenges as RND
also introduces an elegant method to flexibly combine intrinsic and extrinsic rewards. It does so by attaching a second value head to the policy network that uses GAE on the intrinsic rewards in a non-episodic manner to generate value targets. By this approach intrinsic and extrinsic rewards can be discounted differently (and should) and the second value loss and advantages influence the policy training.
RND
can be assigned to novelty-based exploration algorithms. This class of exploration algorithms is quite large. A second member of this class is NovelD
that followed RND
shortly. NovelD
builds directly upon RND
by using the same network distillation and the novelty to set up its own novelty values. This motivates this PR to set a basis for this class of exploration algorithms in RLlib. Other novelty algorithms might then inherit the complex setup from RND
and add their own specific novelty computations on top of it.
As mentioned above the implementation of RND
as derived from the author's original code did not come with challenges it RLlib
:
- It needs a second value head that needs to be attached to the policy model.
- It needs GAE applied to the intrinsic rewards.
- It needs a intrinsic non-pisodic value loss to be added to the policy loss.
- To conform with
RLlib
's computing standards, it needs a parallelized implementation such that it does trade away the advantage of exploring the environment faster for the high performance due to parallel rollouts.
- Is implemented by attaching this head to the policy network in the
__init()__
of the exploration. - Is implemented by using the same calculations as for
PPO
in thepostprocess_trajectory()
of the exploration module. - Is implemented by calculating the value loss in a new
Exploration
method namedcompute_loss_and_update()
that is then overridden.The loss is added by using the code changes proposed in PR #26292. - Is implemented by using the
compute_loss_and_update()
that can train the distillation network in the same iteration as the policy's model is trained and then by syncing via theworker_set
(synching is also implemented in #26292). By this updating mechanism the exploration algorithm can be used in parallel adn it appears that parallel execution even supports better exploration.
What is still open
- I need to adjust the second value head to also work with all RLlib models (CNN, LSTM, Attention, Complex)
- I need to adjust the second value head to also work with plain keras.Model
- I need to adjust the algorithm to also work with multi-agent settings.
- Last, I want to look, if there are other algorithms in RLlib that work similar to PPO, i.e. there is a value head used and GAE
Related PR number
#26292
Checks
- [x] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!