ray
ray copied to clipboard
[Tune] [PBT] Add a flag to force an in-memory checkpoint to be used via `Trial.on_checkpoint`
Why are these changes needed?
This change is needed for the PBT algorithm to run correctly in the case where persistent checkpoints and in-memory checkpoints are both being saved.
Failure example with 2 trials:
- Trial A has an in-memory checkpoint and a persistent checkpoint.
- Trial A is in the
lower_quantileand Trial B is in theupper_quantile, so Trial A exploits Trial B by setting B's in-memory checkpoint as its own usingtrial.on_checkpoint(new_trial.last_checkpoint). - Upon the trial resuming and restoring from its
Trial.newest_checkpoint, the current logic takes either the in-memory checkpoint or persistent checkpoint based on maxcheckpoint_id. However, if Trial B's in-memory checkpoint is staler than A's persistent checkpoint, then Trial A's own persistent checkpoint gets loaded, which contains the wrong model parameters to load (for example). - Trial A ends up with Trial B's hyperparam config but without the same weights as Trial B's model.
This PR allows PBT to force the trial to restore from a different trial's in-memory checkpoint.
Related issue number
N/A
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(