spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-45579][CORE] Catch errors for FallbackStorage.copy

Open ukby1234 opened this issue 1 year ago • 7 comments

What changes were proposed in this pull request?

As documented in the JIRA ticket, FallbackStorage.copy sometimes will throw FileNotFoundException even though we check for file that exists. This will cause the BlockManagerDecommissioner to be stuck in endless loops and prevent executors from exiting. We should ignore any FileNotFoundException in this case, and set keepRunning to false for all other exceptions for retries.

Why are the changes needed?

Fix a bug documented in the JIRA ticket

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests weren't added due to difficulty to replicate the race condition.

Was this patch authored or co-authored using generative AI tooling?

No

ukby1234 avatar Oct 17 '23 21:10 ukby1234

Thank you for making a PR, @ukby1234 .

dongjoon-hyun avatar Oct 17 '23 21:10 dongjoon-hyun

Do you think we can have a test coverage here?

https://github.com/apache/spark/blob/f1ae56b152bdf19246d698b65e553790ad54306b/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala#L43

Added a unit test coverage.

ukby1234 avatar Oct 18 '23 02:10 ukby1234

hmm looks like the SQL test just timed out and I retried a couple times already. cc @dongjoon-hyun

ukby1234 avatar Oct 18 '23 23:10 ukby1234

  1. Does this happen with any fs client other than the s3a one?
  2. Does anyone know why it happens?
  3. There's a pr up to turn off use of the AWS SDK for its uploads, which will switch back to the classic sequential block read/upload algorithm of everything else. Reviews encouraged https://github.com/apache/hadoop/pull/6163

steveloughran avatar Oct 23 '23 13:10 steveloughran

I think I can answer 2). It seems shuffle blocks are deleted in between the fs.exists and fs.copyFromLocal calls. From the stack trace linked in the jira ticket, it fails inside the org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource.

ukby1234 avatar Oct 23 '23 16:10 ukby1234

@ukby1234 thanks

steveloughran avatar Oct 25 '23 12:10 steveloughran

@dongjoon-hyun friendly bump

ukby1234 avatar Jan 22 '24 18:01 ukby1234

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar May 20 '24 00:05 github-actions[bot]