ray icon indicating copy to clipboard operation
ray copied to clipboard

[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial

Open justinvyu opened this issue 3 years ago • 0 comments

Why are these changes needed?

The problem

  • When running synchronous PBT while checkpointing every time a perturbation happens, the experiment can reach a state where trial A is RUNNING but hanging forever without ever performing another train step, and trial B is PAUSED waiting for A to reach the specified perturbation_interval.

Why does this happen?

  1. Synch PBT waits for the last trial to come in to perform exploiting for all trials
  2. PBT can call TrialExecutor.stop_trial(trial) within PBT._exploit before one of the other trials is finished saving (trial.is_saving is still True, and there is a decision in TrialRunner._cached_trial_decisions associated with this trial)
  3. TrialExecutor.stop_trial() will clear all the futures that were to be handled by the trial (including TrialRunner._process_trial_save on the SAVING_RESULT event, which is the event that clears trial.saving_to and pops from TrialRunner._cached_trial_decisions)
  4. This causes trial.saving_to to never be cleared, and trial.is_saving will remain True
  5. Another training result event will come in due to on_pg_ready when the trial starts again (resuming from checkpoint)
    • When train → process_trial_result → goes into the trial.is_saving code path, which only adds the decision to the cache (without a SAVING_RESULT to move it to the decision queue) → trial is hanging forever
    • TrialRunner._post_process_on_training_saving_result will not do anything, since it checks that the trial is not in the TrialRunner._cached_trial_decisions
      • No actions will ever be executed

Fix in the PR

  • The main culprits here are inconsistent Trial.saving_to/Trial.is_saving and TrialRunner._cached_trial_decisions. These are now reset for the trial upon pausing.

Testing

  • This PR includes a test that reproduces this failure mode on the current master and is fixed with the PR. The test artificially creates the scenario by having one trial's checkpointing take a long time (5s), while PBT tries to pause that trial to exploit the other one.

Future TODOs

  • PBT directly calling trial_runner.pause_trial(trial) is not ideal to begin with, and it's the cause of this issue in the first place
  • Refactor this in the future to clearly separate responsibilities between scheduler and trial runner/executor.
  • Make sure that experiment restore is working when PBT pauses trials when other trials are checkpointing.

Related issue number

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

justinvyu avatar Sep 14 '22 18:09 justinvyu