ray icon indicating copy to clipboard operation
ray copied to clipboard

[Tune] Restore fails for GCS directory upload_dir

Open matthewdeng opened this issue 3 years ago • 0 comments

What happened + What you expected to happen

When I run an experiment with the GCS Bucket as the upload_dir, I can successfully restore the experiment.

RunConfig(sync_config=SyncConfig(upload_dir=BUCKET)
2022-09-21 21:31:55,694 INFO trial_runner.py:555 -- Trying to find and download experiment checkpoint at gs://<BUCKET_NAME>/apple-sauce
2022-09-21 21:31:55,907 INFO trial_runner.py:591 -- A remote experiment checkpoint was found and will be used to restore the previous experiment state.
2022-09-21 21:31:55,907 INFO trial_runner.py:738 -- Using following checkpoint to resume: /home/ray/ray_results/apple-sauce/experiment_state-2022-09-21_21-31-50.json
2022-09-21 21:31:55,907 WARNING trial_runner.py:749 -- Attempting to resume experiment from /home/ray/ray_results/apple-sauce. This will ignore any new changes to the specification.
2022-09-21 21:31:55,910 INFO tune.py:654 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.

When I set the upload_dir to be a directory in the bucket, trying to restore from this path will result in the following error.

RunConfig(sync_config=SyncConfig(upload_dir=EXPERIMENT_DIR)
2022-09-21 21:32:05,076 INFO trial_runner.py:555 -- Trying to find and download experiment checkpoint at gs://<BUCKET_NAME>/experiments/apple-sauce
2022-09-21 21:32:05,185 WARNING trial_runner.py:568 -- Got error when trying to sync down: Sync process failed: GetFileInfo() yielded path '<BUCKET_NAME>/experiments', which is outside base dir '<BUCKET_NAME>/experiments/apple-sauce' 
Please check this error message for potential access problems - if a directory was not found, that is expected at this stage when you're starting a new experiment.
2022-09-21 21:32:05,185 INFO trial_runner.py:575 -- No remote checkpoint was found or an error occurred when trying to download the experiment checkpoint. Please check the previous warning message for more details. Ray Tune will now start a new experiment.

Versions / Dependencies

Ray nightly wheel https://github.com/ray-project/ray/commit/fa182d3c9e478ef4c169ccf7459764768996110f

Reproduction script

from ray import tune
from ray.air import  session
from ray.air.config import RunConfig
from ray.tune import Tuner

BUCKET = <BUCKET_NAME>
EXPERIMENT_DIR = f"{BUCKET}/experiments"
EXPERIMENT_NAME = "apple-sauce"

def trainable(config):
    session.report({"metric": 1})


def test_bucket():
    tuner = Tuner(
        trainable,
        run_config=RunConfig(
            name=EXPERIMENT_NAME,
            sync_config=tune.SyncConfig(
                upload_dir=BUCKET,
                sync_period=10, 
            ),
        ),
    ).fit()
    
    experiment_path = f"{BUCKET}/{EXPERIMENT_NAME}"
    Tuner.restore(experiment_path).fit()

def test_dir():
    Tuner(
        trainable,
        run_config=RunConfig(
            name=EXPERIMENT_NAME,
            sync_config=tune.SyncConfig(
                upload_dir=EXPERIMENT_DIR,
                sync_period=10, 
            ),
        ),
    ).fit()
    
    experiment_path = f"{EXPERIMENT_DIR}/{EXPERIMENT_NAME}"
    Tuner.restore(experiment_path).fit()

print("TEST BUCKET")
test_bucket()
print("TEST DIR")
test_dir()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

matthewdeng avatar Sep 22 '22 04:09 matthewdeng