ray icon indicating copy to clipboard operation
ray copied to clipboard

[Tune] [PBT] Add a flag to force an in-memory checkpoint to be used via `Trial.on_checkpoint`

Open justinvyu opened this issue 3 years ago • 0 comments

Why are these changes needed?

This change is needed for the PBT algorithm to run correctly in the case where persistent checkpoints and in-memory checkpoints are both being saved.

Failure example with 2 trials:

  • Trial A has an in-memory checkpoint and a persistent checkpoint.
  • Trial A is in the lower_quantile and Trial B is in the upper_quantile, so Trial A exploits Trial B by setting B's in-memory checkpoint as its own using trial.on_checkpoint(new_trial.last_checkpoint).
  • Upon the trial resuming and restoring from its Trial.newest_checkpoint, the current logic takes either the in-memory checkpoint or persistent checkpoint based on max checkpoint_id. However, if Trial B's in-memory checkpoint is staler than A's persistent checkpoint, then Trial A's own persistent checkpoint gets loaded, which contains the wrong model parameters to load (for example).
  • Trial A ends up with Trial B's hyperparam config but without the same weights as Trial B's model.

This PR allows PBT to force the trial to restore from a different trial's in-memory checkpoint.

Related issue number

N/A

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

justinvyu avatar Sep 14 '22 17:09 justinvyu