ray
ray copied to clipboard
Release test tune_cloud_durable_upload_rllib_trainer.aws failed
Release test tune_cloud_durable_upload_rllib_trainer.aws failed. See https://buildkite.com/ray-project/release-tests-branch/builds/1770#018899d9-3303-4bb2-b374-1842c1172e10 for more details. cc @ml
-- created by ray-test-bot
FYI, this is an unstable test so I don't know if you want to ignore it. Ignoring it too long will eventually jail the test though. Please see https://www.notion.so/anyscale-hq/OSS-Test-Policy-47d2f1ebae59407eae09a75380f6282b for understanding different test states. Thankks
Looks like this is caused by an incompatible s3fs
version. I also ran into this when using Tune w/ cloud storage, and the fix was to manually upgrade to s3fs==2023.5.0
.
Things to investigate:
- Why doesn't this also happen in the other tune cloud tests? See
tune_cloud_durable_upload.aws
that passed. - Why did this only show up now?
(PPO pid=178, ip=10.0.51.87) Last sync still in progress, skipping sync up of /home/ray/ray_results/cloud_durable_upload/PPO_CartPole-v1_d5bd0_00001_1_id=1_2023-06-08_01-02-42/ to s3://tune-cloud-tests/durable_upload_rllib_str/test_1686211340/cloud_durable_upload/PPO_CartPole-v1_d5bd0_00001_1_id=1_2023-06-08_01-02-42
--
| (PPO pid=178, ip=10.0.51.87) Caught sync error: Sync process failed: The specified bucket does not exist. Retrying after sleeping for 1.0 seconds... [repeated 8x across cluster]
Regarding the instability of this test: I don't think this should be labeled as unstable anymore. The corresponding tune_cloud_durable_upload.aws
test is not marked unstable.
By the way @can-anyscale, why doesn't this show failure up on the go/ossci
dashboard?
Why did this only show up now?
Ohh, it's because it's unstable. This test has actually been failing for a long time -- probably since this PR: https://github.com/ray-project/ray/pull/34663
Removing the unstable mark after fixing the test sound good!
Why doesn't this also happen in the other tune cloud tests? See
tune_cloud_durable_upload.aws
that passed.
This is because these release tests use different base images. tune_cloud_durable_upload.aws
uses the base ray image, while the failing tests use the ray-ml
image. The ray-ml
image is the one that pins the s3fs
version, which causes this bucket not found error.
The release tests passed on the initial PR: https://github.com/ray-project/ray/pull/34663
This is because:
- Release tests don't build new docker images. They just use the nightly image + build a new version of Ray on top. This new Ray wheel comes from the PR (if running release tests through the release tests PR pipeline).
- The PR added
s3fs
as a dependency inml/requirements_tune.txt
. - The nightly image filled in by
env["RAY_IMAGE_ML_NIGHTLY_GPU"]
is"anyscale/ray-ml:nightly-py37-gpu"
. At this time, the nightly image does not haves3fs
. - The release test ran without
s3fs
, and you can see in the job logs thats3fs
was not used because it's not installed:
2023-04-26 03:04:12,066 WARNING syncer.py:223 -- You are using S3 for remote storage, but you don't have `s3fs` installed. Due to a bug in PyArrow, this can lead to significant slowdowns. To avoid this, install s3fs with `pip install fsspec s3fs`.
- The test passed because it didn't go through the
s3fs
path and defaulted back to pyarrow for syncing with s3. - The next day, the test probably started failing, due to
s3fs
being included in the nightly image and going through the new codepath.
Problems:
- Mismatched dependencies in the nightly docker image and the new dependencies that developers want to add. Any ideas on how to fix this @can-anyscale?
- Unstable release test causing this issue to be hidden for a long time on the ossci dashboard. (As oncall, I don't look monitor buildkite at all, I'm just monitoring the preset dashboard.)
- TODO: Go through the list of our release tests and see which ones should actually be marked as unstable.
- TODO: Add a view on the preset dashboard that specifically keeps track of unstable tests. They shouldn't affect the pass % metrics, but they should still be visible somewhere.
Sidenote: see this issue on s3fs
. https://github.com/fsspec/s3fs/issues/738
Problem with upgrading s3fs
:
- Latest s3fs (
2023.6.0
) depends on aiobotocore2.5
depends on botocore<1.29.77,>=1.29.76 - We pin boto3 elsewhere to 1.26.82, and that depends on botocore<1.30.0,>=1.29.82
Test has been failing for far too long. Jailing.
Test passed on latest run: https://buildkite.com/ray-project/release-tests-branch/builds/1827#0188f67a-34ee-438c-be7b-9d62f4aa7647