[RLlib] Stashed policies are being accessed excessively, defeating the purpose of a policy cache
What happened + What you expected to happen
The idea of a policy cache is to stash unused policies on disk or in object store, to alleviate memory stress. That requires our code to only access cached policies, and restore states properly when a policy is un-stashed, at all times. If we access already stashed policies blindly, we will get into a situation where useful policies get stashed then immediately un-stashed, which will slow down things significantly and unnecessarilly. While debugging things, I notice at least the following places where we are accessing ALL policies, regardless of whether they are in the cache or not:
-
Syncing weights from all policies to Eval workers. This will cause the local and eval workers to stash then unstash all policies. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/algorithms/algorithm.py#L816-L826
-
RolloutWorker seems to set global vars on all policies regardless of whether they are stashed. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/evaluation/rollout_worker.py#L1781-L1782
-
APPO target network update works on all trainable policies. This will cause excessive policy restoring on the training workers. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/algorithms/appo/appo.py#L239-L242
These are the places I have discovered so far. Things do seem a lot quieter if I comment out all these logics.
Versions / Dependencies
Master
Reproduction script
Add logging lines where we restore and stashing policies, then run:
bazel run rllib/learning_tests_multi_agent_cartpole_w_100_policies_appo
Issue Severity
Medium: It is a significant difficulty but I can work around it.