ray icon indicating copy to clipboard operation
ray copied to clipboard

[air] Add long running cloud storage checkpoint test

Open krfricke opened this issue 1 year ago • 13 comments

Why are these changes needed?

This is a test to make sure that cloud storage access still works after multi-day training. See e.g. #35519

This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles.

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

krfricke avatar Jun 06 '23 09:06 krfricke

(leaving this for @can-anyscale to review)

aslonnie avatar Jun 06 '23 18:06 aslonnie

Do we know what was the root cause for https://github.com/ray-project/ray/issues/35519? Having a test run for 2.5 day worries me a bit. When it failed it will take us minimum a week to resolve.

Maybe it is ok to have a long running test, as long as there are other cheap tests serve as a main mean to find regressions.

can-anyscale avatar Jun 06 '23 19:06 can-anyscale

Do we know what was the root cause for #35519? Having a test run for 2.5 day worries me a bit. When it failed it will take us minimum a week to resolve.

Maybe it is ok to have a long running test, as long as there are other cheap tests serve as a main mean to find regressions.

The root cause was - likely - that we cached filesystem objects instead of reinstating them, which caused stale credentials to stick around for too long.

If there's a way to have the credentials expire and refresh earlier on e.g. AWS that would be preferable.

In any case, some users are running tests that take longer than 2 days to run. Not having this kind of test at all is also a gap. IMO detecting bugs with low granularity is still better than not detecting the bug at all.

krfricke avatar Jun 07 '23 09:06 krfricke

some users also run Ray for more than 30 days, should we also have a 30-day Ray release test?

aslonnie avatar Jun 07 '23 12:06 aslonnie

Is there a good way to auto-collect crashes or usage patterns from ray users? How many X users running workload for more than 2 days? How often they fail? This test is costly and will often delay the release processes, which itself deliver fixes to Y number of users. I guess what I'm trying to say is between protecting X users vs. delaying fixes to Y users, this test makes more sense if X is a big enough number and potentially bigger than Y.

If X is relatively small then maybe it is ok to sometimes release with bugs for their use cases, as long as, and this is important, we can optimize for releasing fixes quickly, like within a week.

I guess what I'm trying to say is, our release process should take as long as it should to protect the major 90% of use cases; for the remaining of 10% of long tails, it should optimize towards speed instead of protection. Does it make any sense?

can-anyscale avatar Jun 07 '23 15:06 can-anyscale

some users also run Ray for more than 30 days, should we also have a 30-day Ray release test?

I get your point. I think if we expect some state to change after sometime, we should test that. Ideally, we can create this state change earlier. So in the case of this test, if we can set the credentials to expire every 15 minutes rather than e.g. every 24hours, we would only need to run it for 30 minutes.

Can we do that? cc @ijrsvt

To answer your question, I think it's unfeasible as a release test, but I do think we should run very long-running workloads ourselves for dogfooding.

Generally there's a trade-off between time-to-fix and test-duration. We have a number of 24hour running long running jobs, and they've helped us identify a lot of issues in the past (e.g. memory leaks, segfaults, application logic error, performance regressions).

So it's not that they don't have any utility. The question is, what is the cut-off, and when do we get diminishing returns, which is what @can-anyscale is raising.

krfricke avatar Jun 08 '23 08:06 krfricke

Is there a good way to auto-collect crashes or usage patterns from ray users? How many X users running workload for more than 2 days? How often they fail? This test is costly and will often delay the release processes, which itself deliver fixes to Y number of users. I guess what I'm trying to say is between protecting X users vs. delaying fixes to Y users, this test makes more sense if X is a big enough number and potentially bigger than Y.

If X is relatively small then maybe it is ok to sometimes release with bugs for their use cases, as long as, and this is important, we can optimize for releasing fixes quickly, like within a week.

I guess what I'm trying to say is, our release process should take as long as it should to protect the major 90% of use cases; for the remaining of 10% of long tails, it should optimize towards speed instead of protection. Does it make any sense?

At this time, I mainly needed to be able to run a test for 2.5 days to check if our fix of https://github.com/ray-project/ray/issues/35519 actually resolves the issue. The test will cost about $15 to run on aws, for a weekly test I think this is an acceptable cost.

Again, I'd ideally like to test this on a shorter timescale. Assuming this doesn't work easily - if you have concerns regarding this test slowing down releases, can we run it weekly and mark it as unstable, i.e. make it non-release blocking?

krfricke avatar Jun 08 '23 09:06 krfricke

Can we do that? cc @ijrsvt

@ijrsvt is OOO, but you certainly can assume aws role and have a credential that expires in 15 minutes.

https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html

if it is not restricted to aws credentials, some other credentials can have even shorter lifespan.

aslonnie avatar Jun 08 '23 12:06 aslonnie

e.g. every 24hours, we would only need to run it for 30 minutes.

AWS Assume Role Credentials cannot last for more than 12 hours. #35519 was using VM-based credentials which last for 6 hours.

I'm wondering if the issue is in pyarrow/s3fs. I would expect the following code to work on an EC2 instance:

fs = pyarrow.fs.FileSystem.from_uri("s3://my_bucket)
while True:
  fs.write_data() # Not a real function
# Loop runs forever

ijrsvt avatar Jun 08 '23 14:06 ijrsvt

@krfricke oh yes definitely marking it as non-release-blocking will resolve my concern, thanks Kai

can-anyscale avatar Jun 08 '23 15:06 can-anyscale

e.g. every 24hours, we would only need to run it for 30 minutes.

AWS Assume Role Credentials cannot last for more than 12 hours. #35519 was using VM-based credentials which last for 6 hours.

Can we selectively have credentials expire earlier on clusters? If not, does this still mean that the test only needs to run for 12 hours and a few minutes?

I'm wondering if the issue is in pyarrow/s3fs. I would expect the following code to work on an EC2 instance:

fs = pyarrow.fs.FileSystem.from_uri("s3://my_bucket)
while True:
  fs.write_data() # Not a real function
# Loop runs forever

I think this is a combination of s3fs (fsspec) wrapped by pyarrow: There seems to be an issue describing this: https://github.com/fsspec/s3fs/issues/632

Expiring the cached FS after 5 minutes thus should have resolved the issue.

@krfricke oh yes definitely marking it as non-release-blocking will resolve my concern, thanks Kai

Let's see if we can just cut the test length down to 13 hours or even less (looks like this may be possible), in which case I think we can add it as a 3x/week test.

krfricke avatar Jun 09 '23 09:06 krfricke

@krfricke, after looking at the original issue again, I'm more suspicious that this has nothing to do with credentials & is instead some other error. The error message [1] seems to just catch a general error & print a statement indicating that credentials are not working. Is it possible that after 2 days the checkpoint became too big and caused this error?

If credential expiration was the root cause of https://github.com/ray-project/ray/issues/35519, it would occur at the 6 hour mark, not the 2+ days mark. It could also imply that there is a non-deterministic credential refresh failure, but that seems unlikely as the s3fs source pretty direclty uses AWS's Botocore library [2]--which is designed to handle refreshing VM credentials correctly (we use this feature extensively).

The S3fs issue linked (https://github.com/fsspec/s3fs/issues/632) is different as it refers to hard-coded temporary credentials, not VM credentials. Botocore does not natively support refreshing these.

Can we selectively have credentials expire earlier on clusters? If not, does this still mean that the test only needs to run for 12 hours and a few minutes?

VM credentials expire after 6 hours and I am unaware of a way to change this time.

[1] https://github.com/ray-project/ray/blob/d2ab1297769767ca0d702172cc41de5dac8b9bb3/python/ray/tune/trainable/trainable.py#L671

[2] https://github.com/fsspec/s3fs/blob/10c67473a809ad254647e4ba4d9fe30b4f67ca89/s3fs/core.py#LL502C33-L502C33

ijrsvt avatar Jun 12 '23 16:06 ijrsvt

@krfricke, after looking at the original issue again, I'm more suspicious that this has nothing to do with credentials & is instead some other error. The error message [1] seems to just catch a general error & print a statement indicating that credentials are not working. Is it possible that after 2 days the checkpoint became too big and caused this error?

If credential expiration was the root cause of #35519, it would occur at the 6 hour mark, not the 2+ days mark. It could also imply that there is a non-deterministic credential refresh failure, but that seems unlikely as the s3fs source pretty direclty uses AWS's Botocore library [2]--which is designed to handle refreshing VM credentials correctly (we use this feature extensively).

Unfortunately I don't have more background on the issue. But I agree, it looks like the issue may be different. I'll try to follow-up to see if there may have been other factors.

The S3fs issue linked (fsspec/s3fs#632) is different as it refers to hard-coded temporary credentials, not VM credentials. Botocore does not natively support refreshing these.

Can we selectively have credentials expire earlier on clusters? If not, does this still mean that the test only needs to run for 12 hours and a few minutes?

VM credentials expire after 6 hours and I am unaware of a way to change this time.

[1]

https://github.com/ray-project/ray/blob/d2ab1297769767ca0d702172cc41de5dac8b9bb3/python/ray/tune/trainable/trainable.py#L671

[2] https://github.com/fsspec/s3fs/blob/10c67473a809ad254647e4ba4d9fe30b4f67ca89/s3fs/core.py#LL502C33-L502C33

The test probably still has merit, though I understand that the root cause may have been different and it may have worked correctly before. I'll update the error message when syncing fails to reflect the cause better.

krfricke avatar Jun 14 '23 14:06 krfricke