spark [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

[SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Open warrenzhu25 opened this issue 2 years ago • 4 comments

What changes were proposed in this pull request?

Add a config spark.stage.ignoreDecommissionFetchFailure to control whether ignore stage fetch failure caused by decommissioned executor when count spark.stage.maxConsecutiveAttempts

Executor is considered as decommissioned in below cases:

Waiting fro decommission start
Under decommission process
Stopped or terminated after finishing decommission
Under decommission process, then removed by driver with other reasons

In case 4, fetch failure might not be caused by executor decommission. But this is best effort approach based on current mechanism.

Why are the changes needed?

When executor decommission is enabled, there would be more stage failure caused by FetchFailed from decommissioned executor, further causing whole job's failure. One reason is decommissioning executor won't wait all FetchData requests to be finished, it will self-exit when no running tasks and migration finished. It would be better not to count such failure in spark.stage.maxConsecutiveAttempts. AWS EMR already supported this. Please refer https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added test in DAGSchedulerSuite

Sep 18 '22 22:09 warrenzhu25

The following claims seems to be too general to be true. Could you be more specific, @warrenzhu25 ?

When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor,

cc @holdenk too

Updated description.

Sep 19 '22 00:09 warrenzhu25

Thank you for the swift updates, @warrenzhu25 .

Sep 19 '22 00:09 dongjoon-hyun

I made a few comments. BTW, thank you for working on this area, @warrenzhu25 .

Sep 19 '22 01:09 dongjoon-hyun

Can one of the admins verify this patch?

Sep 19 '22 21:09 AmplabJenkins

@dongjoon-hyun Any more comments?

Sep 27 '22 18:09 warrenzhu25

Sorry. I was too busy about 3.3.1 preparation. Will resume the review again for this new improvement.

Sep 27 '22 18:09 dongjoon-hyun

spark spark copied to clipboard

[SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard