[Tune] [PBT] Add a flag to force an in-memory checkpoint to be used via `Trial.on_checkpoint`

Open justinvyu opened this issue 3 years ago • 0 comments

Why are these changes needed?

This change is needed for the PBT algorithm to run correctly in the case where persistent checkpoints and in-memory checkpoints are both being saved.

Failure example with 2 trials:

Trial A has an in-memory checkpoint and a persistent checkpoint.
Trial A is in the lower_quantile and Trial B is in the upper_quantile, so Trial A exploits Trial B by setting B's in-memory checkpoint as its own using trial.on_checkpoint(new_trial.last_checkpoint).
Upon the trial resuming and restoring from its Trial.newest_checkpoint, the current logic takes either the in-memory checkpoint or persistent checkpoint based on max checkpoint_id. However, if Trial B's in-memory checkpoint is staler than A's persistent checkpoint, then Trial A's own persistent checkpoint gets loaded, which contains the wrong model parameters to load (for example).
Trial A ends up with Trial B's hyperparam config but without the same weights as Trial B's model.

This PR allows PBT to force the trial to restore from a different trial's in-memory checkpoint.

N/A

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Sep 14 '22 17:09 justinvyu