ray icon indicating copy to clipboard operation
ray copied to clipboard

[tune] Improve excessive syncing warning and make some deprecations

Open justinvyu opened this issue 9 months ago • 1 comments

Why are these changes needed?

This PR makes it so that the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing and does not always trigger at the end of training.

  • Previously, the wording made it seem like this was an error, which makes it confusing when it's printed out along with an actual checkpointing error.
  • The warning would also often happen at the end of training, since we always trigger a forced driver sync at the end of a run, where a previous sync may have happened really recently. For example, a simple script like this will print out the error:
from ray import tune

kwargs = {"kwarg1": 1, "kwarg2": 2}


def train_fn(config, **trainable_kwargs):
    print(config, trainable_kwargs)


tune.Tuner(tune.with_parameters(train_fn, **kwargs)).fit()
Trial train_fn_a9462_00000 completed after 0 iterations at 2024-05-08 15:38:18. Total running time: 1s
2024-05-08 15:38:18,757 WARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).
2024-05-08 15:38:18,758 INFO tune.py:1007 -- Wrote the latest version of all result files and experiment state to '/Users/justin/ray_results/train_fn_2024-05-08_15-38-13' in 0.0031s.

This PR also makes some deprecations: TUNE_RESULT_DIR, RAY_AIR_LOCAL_CACHE_DIR, local_dir

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

justinvyu avatar May 08 '24 22:05 justinvyu

w00t I just rebase your PR on top of the latest master to avoid a bug in microcheck, hope you don't mind, thankks

can-anyscale avatar May 17 '24 03:05 can-anyscale