ray
ray copied to clipboard
[Tune] Restore fails for GCS directory upload_dir
What happened + What you expected to happen
When I run an experiment with the GCS Bucket as the upload_dir, I can successfully restore the experiment.
RunConfig(sync_config=SyncConfig(upload_dir=BUCKET)
2022-09-21 21:31:55,694 INFO trial_runner.py:555 -- Trying to find and download experiment checkpoint at gs://<BUCKET_NAME>/apple-sauce
2022-09-21 21:31:55,907 INFO trial_runner.py:591 -- A remote experiment checkpoint was found and will be used to restore the previous experiment state.
2022-09-21 21:31:55,907 INFO trial_runner.py:738 -- Using following checkpoint to resume: /home/ray/ray_results/apple-sauce/experiment_state-2022-09-21_21-31-50.json
2022-09-21 21:31:55,907 WARNING trial_runner.py:749 -- Attempting to resume experiment from /home/ray/ray_results/apple-sauce. This will ignore any new changes to the specification.
2022-09-21 21:31:55,910 INFO tune.py:654 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
When I set the upload_dir to be a directory in the bucket, trying to restore from this path will result in the following error.
RunConfig(sync_config=SyncConfig(upload_dir=EXPERIMENT_DIR)
2022-09-21 21:32:05,076 INFO trial_runner.py:555 -- Trying to find and download experiment checkpoint at gs://<BUCKET_NAME>/experiments/apple-sauce
2022-09-21 21:32:05,185 WARNING trial_runner.py:568 -- Got error when trying to sync down: Sync process failed: GetFileInfo() yielded path '<BUCKET_NAME>/experiments', which is outside base dir '<BUCKET_NAME>/experiments/apple-sauce'
Please check this error message for potential access problems - if a directory was not found, that is expected at this stage when you're starting a new experiment.
2022-09-21 21:32:05,185 INFO trial_runner.py:575 -- No remote checkpoint was found or an error occurred when trying to download the experiment checkpoint. Please check the previous warning message for more details. Ray Tune will now start a new experiment.
Versions / Dependencies
Ray nightly wheel https://github.com/ray-project/ray/commit/fa182d3c9e478ef4c169ccf7459764768996110f
Reproduction script
from ray import tune
from ray.air import session
from ray.air.config import RunConfig
from ray.tune import Tuner
BUCKET = <BUCKET_NAME>
EXPERIMENT_DIR = f"{BUCKET}/experiments"
EXPERIMENT_NAME = "apple-sauce"
def trainable(config):
session.report({"metric": 1})
def test_bucket():
tuner = Tuner(
trainable,
run_config=RunConfig(
name=EXPERIMENT_NAME,
sync_config=tune.SyncConfig(
upload_dir=BUCKET,
sync_period=10,
),
),
).fit()
experiment_path = f"{BUCKET}/{EXPERIMENT_NAME}"
Tuner.restore(experiment_path).fit()
def test_dir():
Tuner(
trainable,
run_config=RunConfig(
name=EXPERIMENT_NAME,
sync_config=tune.SyncConfig(
upload_dir=EXPERIMENT_DIR,
sync_period=10,
),
),
).fit()
experiment_path = f"{EXPERIMENT_DIR}/{EXPERIMENT_NAME}"
Tuner.restore(experiment_path).fit()
print("TEST BUCKET")
test_bucket()
print("TEST DIR")
test_dir()
Issue Severity
Medium: It is a significant difficulty but I can work around it.