ray
ray copied to clipboard
[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial
Why are these changes needed?
The problem
- When running synchronous PBT while checkpointing every time a perturbation happens, the experiment can reach a state where trial A is
RUNNINGbut hanging forever without ever performing another train step, and trial B isPAUSEDwaiting for A to reach the specifiedperturbation_interval.
Why does this happen?
- Synch PBT waits for the last trial to come in to perform exploiting for all trials
- PBT can call
TrialExecutor.stop_trial(trial)withinPBT._exploitbefore one of the other trials is finished saving (trial.is_savingis still True, and there is a decision inTrialRunner._cached_trial_decisionsassociated with this trial) TrialExecutor.stop_trial()will clear all the futures that were to be handled by the trial (includingTrialRunner._process_trial_saveon the SAVING_RESULT event, which is the event that clearstrial.saving_toand pops fromTrialRunner._cached_trial_decisions)- This causes
trial.saving_toto never be cleared, andtrial.is_savingwill remain True - Another training result event will come in due to
on_pg_readywhen the trial starts again (resuming from checkpoint)- When train → process_trial_result → goes into the
trial.is_savingcode path, which only adds the decision to the cache (without a SAVING_RESULT to move it to the decision queue) → trial is hanging forever TrialRunner._post_process_on_training_saving_resultwill not do anything, since it checks that the trial is not in theTrialRunner._cached_trial_decisions- No actions will ever be executed
- When train → process_trial_result → goes into the
Fix in the PR
- The main culprits here are inconsistent
Trial.saving_to/Trial.is_savingandTrialRunner._cached_trial_decisions. These are now reset for the trial upon pausing.
Testing
- This PR includes a test that reproduces this failure mode on the current
masterand is fixed with the PR. The test artificially creates the scenario by having one trial's checkpointing take a long time (5s), while PBT tries to pause that trial to exploit the other one.
Future TODOs
PBTdirectly callingtrial_runner.pause_trial(trial)is not ideal to begin with, and it's the cause of this issue in the first place- Refactor this in the future to clearly separate responsibilities between scheduler and trial runner/executor.
- Make sure that experiment restore is working when PBT pauses trials when other trials are checkpointing.
Related issue number
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(