spark
spark copied to clipboard
[SPARK-45579][CORE] Catch errors for FallbackStorage.copy
What changes were proposed in this pull request?
As documented in the JIRA ticket, FallbackStorage.copy sometimes will throw FileNotFoundException even though we check for file that exists. This will cause the BlockManagerDecommissioner to be stuck in endless loops and prevent executors from exiting. We should ignore any FileNotFoundException in this case, and set keepRunning to false for all other exceptions for retries.
Why are the changes needed?
Fix a bug documented in the JIRA ticket
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tests weren't added due to difficulty to replicate the race condition.
Was this patch authored or co-authored using generative AI tooling?
No
Thank you for making a PR, @ukby1234 .
Do you think we can have a test coverage here?
https://github.com/apache/spark/blob/f1ae56b152bdf19246d698b65e553790ad54306b/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala#L43
Added a unit test coverage.
hmm looks like the SQL test just timed out and I retried a couple times already. cc @dongjoon-hyun
- Does this happen with any fs client other than the s3a one?
- Does anyone know why it happens?
- There's a pr up to turn off use of the AWS SDK for its uploads, which will switch back to the classic sequential block read/upload algorithm of everything else. Reviews encouraged https://github.com/apache/hadoop/pull/6163
I think I can answer 2). It seems shuffle blocks are deleted in between the fs.exists
and fs.copyFromLocal
calls. From the stack trace linked in the jira ticket, it fails inside the org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource
.
@ukby1234 thanks
@dongjoon-hyun friendly bump
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!