spark
spark copied to clipboard
[SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor
What changes were proposed in this pull request?
Add a config spark.stage.ignoreDecommissionFetchFailure
to control whether ignore stage fetch failure caused by decommissioned executor when count spark.stage.maxConsecutiveAttempts
Executor is considered as decommissioned in below cases:
- Waiting fro decommission start
- Under decommission process
- Stopped or terminated after finishing decommission
- Under decommission process, then removed by driver with other reasons
In case 4, fetch failure might not be caused by executor decommission. But this is best effort approach based on current mechanism.
Why are the changes needed?
When executor decommission is enabled, there would be more stage failure caused by FetchFailed from decommissioned executor, further causing whole job's failure. One reason is decommissioning executor won't wait all FetchData requests to be finished, it will self-exit when no running tasks and migration finished. It would be better not to count such failure in spark.stage.maxConsecutiveAttempts
. AWS EMR already supported this. Please refer https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Added test in DAGSchedulerSuite
The following claims seems to be too general to be true. Could you be more specific, @warrenzhu25 ?
When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor,
cc @holdenk too
Updated description.
Thank you for the swift updates, @warrenzhu25 .
I made a few comments. BTW, thank you for working on this area, @warrenzhu25 .
Can one of the admins verify this patch?
@dongjoon-hyun Any more comments?
Sorry. I was too busy about 3.3.1 preparation. Will resume the review again for this new improvement.