ray icon indicating copy to clipboard operation
ray copied to clipboard

Release test tune_cloud_durable_upload_rllib_trainer.aws failed

Open can-anyscale opened this issue 1 year ago • 9 comments

Release test tune_cloud_durable_upload_rllib_trainer.aws failed. See https://buildkite.com/ray-project/release-tests-branch/builds/1770#018899d9-3303-4bb2-b374-1842c1172e10 for more details. cc @ml

 -- created by ray-test-bot

can-anyscale avatar Jun 08 '23 08:06 can-anyscale

FYI, this is an unstable test so I don't know if you want to ignore it. Ignoring it too long will eventually jail the test though. Please see https://www.notion.so/anyscale-hq/OSS-Test-Policy-47d2f1ebae59407eae09a75380f6282b for understanding different test states. Thankks

can-anyscale avatar Jun 08 '23 15:06 can-anyscale

Looks like this is caused by an incompatible s3fs version. I also ran into this when using Tune w/ cloud storage, and the fix was to manually upgrade to s3fs==2023.5.0.

Things to investigate:

  • Why doesn't this also happen in the other tune cloud tests? See tune_cloud_durable_upload.aws that passed.
  • Why did this only show up now?
(PPO pid=178, ip=10.0.51.87) Last sync still in progress, skipping sync up of /home/ray/ray_results/cloud_durable_upload/PPO_CartPole-v1_d5bd0_00001_1_id=1_2023-06-08_01-02-42/ to s3://tune-cloud-tests/durable_upload_rllib_str/test_1686211340/cloud_durable_upload/PPO_CartPole-v1_d5bd0_00001_1_id=1_2023-06-08_01-02-42
--
  | (PPO pid=178, ip=10.0.51.87) Caught sync error: Sync process failed: The specified bucket does not exist. Retrying after sleeping for 1.0 seconds... [repeated 8x across cluster]

Regarding the instability of this test: I don't think this should be labeled as unstable anymore. The corresponding tune_cloud_durable_upload.aws test is not marked unstable.

justinvyu avatar Jun 08 '23 20:06 justinvyu

By the way @can-anyscale, why doesn't this show failure up on the go/ossci dashboard?

Why did this only show up now?

Ohh, it's because it's unstable. This test has actually been failing for a long time -- probably since this PR: https://github.com/ray-project/ray/pull/34663

justinvyu avatar Jun 08 '23 20:06 justinvyu

Removing the unstable mark after fixing the test sound good!

can-anyscale avatar Jun 08 '23 20:06 can-anyscale

Why doesn't this also happen in the other tune cloud tests? See tune_cloud_durable_upload.aws that passed.

This is because these release tests use different base images. tune_cloud_durable_upload.aws uses the base ray image, while the failing tests use the ray-ml image. The ray-ml image is the one that pins the s3fs version, which causes this bucket not found error.

justinvyu avatar Jun 09 '23 18:06 justinvyu

The release tests passed on the initial PR: https://github.com/ray-project/ray/pull/34663

This is because:

  1. Release tests don't build new docker images. They just use the nightly image + build a new version of Ray on top. This new Ray wheel comes from the PR (if running release tests through the release tests PR pipeline).
  2. The PR added s3fs as a dependency in ml/requirements_tune.txt.
  3. The nightly image filled in by env["RAY_IMAGE_ML_NIGHTLY_GPU"] is "anyscale/ray-ml:nightly-py37-gpu". At this time, the nightly image does not have s3fs.
  4. The release test ran without s3fs, and you can see in the job logs that s3fs was not used because it's not installed:
2023-04-26 03:04:12,066 WARNING syncer.py:223 -- You are using S3 for remote storage, but you don't have `s3fs` installed. Due to a bug in PyArrow, this can lead to significant slowdowns. To avoid this, install s3fs with `pip install fsspec s3fs`.
  1. The test passed because it didn't go through the s3fs path and defaulted back to pyarrow for syncing with s3.
  2. The next day, the test probably started failing, due to s3fs being included in the nightly image and going through the new codepath.

Problems:

  • Mismatched dependencies in the nightly docker image and the new dependencies that developers want to add. Any ideas on how to fix this @can-anyscale?
  • Unstable release test causing this issue to be hidden for a long time on the ossci dashboard. (As oncall, I don't look monitor buildkite at all, I'm just monitoring the preset dashboard.)
    • TODO: Go through the list of our release tests and see which ones should actually be marked as unstable.
    • TODO: Add a view on the preset dashboard that specifically keeps track of unstable tests. They shouldn't affect the pass % metrics, but they should still be visible somewhere.

justinvyu avatar Jun 09 '23 22:06 justinvyu

Sidenote: see this issue on s3fs. https://github.com/fsspec/s3fs/issues/738

justinvyu avatar Jun 09 '23 22:06 justinvyu

Problem with upgrading s3fs:

  • Latest s3fs (2023.6.0) depends on aiobotocore 2.5 depends on botocore<1.29.77,>=1.29.76
  • We pin boto3 elsewhere to 1.26.82, and that depends on botocore<1.30.0,>=1.29.82

justinvyu avatar Jun 09 '23 23:06 justinvyu

Test has been failing for far too long. Jailing.

can-anyscale avatar Jun 13 '23 09:06 can-anyscale

Test passed on latest run: https://buildkite.com/ray-project/release-tests-branch/builds/1827#0188f67a-34ee-438c-be7b-9d62f4aa7647

can-anyscale avatar Jun 26 '23 07:06 can-anyscale