[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial

Open justinvyu opened this issue 3 years ago • 0 comments

Why are these changes needed?

When running synchronous PBT while checkpointing every time a perturbation happens, the experiment can reach a state where trial A is RUNNING but hanging forever without ever performing another train step, and trial B is PAUSED waiting for A to reach the specified perturbation_interval.

Synch PBT waits for the last trial to come in to perform exploiting for all trials
PBT can call TrialExecutor.stop_trial(trial) within PBT._exploit before one of the other trials is finished saving (trial.is_saving is still True, and there is a decision in TrialRunner._cached_trial_decisions associated with this trial)
TrialExecutor.stop_trial() will clear all the futures that were to be handled by the trial (including TrialRunner._process_trial_save on the SAVING_RESULT event, which is the event that clears trial.saving_to and pops from TrialRunner._cached_trial_decisions)
This causes trial.saving_to to never be cleared, and trial.is_saving will remain True
Another training result event will come in due to on_pg_ready when the trial starts again (resuming from checkpoint)
- When train → process_trial_result → goes into the trial.is_saving code path, which only adds the decision to the cache (without a SAVING_RESULT to move it to the decision queue) → trial is hanging forever
- TrialRunner._post_process_on_training_saving_result will not do anything, since it checks that the trial is not in the TrialRunner._cached_trial_decisions
  - No actions will ever be executed

The main culprits here are inconsistent Trial.saving_to/Trial.is_saving and TrialRunner._cached_trial_decisions. These are now reset for the trial upon pausing.

This PR includes a test that reproduces this failure mode on the current master and is fixed with the PR. The test artificially creates the scenario by having one trial's checkpointing take a long time (5s), while PBT tries to pause that trial to exploit the other one.

PBT directly calling trial_runner.pause_trial(trial) is not ideal to begin with, and it's the cause of this issue in the first place
Refactor this in the future to clearly separate responsibilities between scheduler and trial runner/executor.
Make sure that experiment restore is working when PBT pauses trials when other trials are checkpointing.

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Sep 14 '22 18:09 justinvyu