spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Open warrenzhu25 opened this issue 2 years ago • 4 comments

What changes were proposed in this pull request?

Add a config spark.stage.ignoreDecommissionFetchFailure to control whether ignore stage fetch failure caused by decommissioned executor when count spark.stage.maxConsecutiveAttempts

Executor is considered as decommissioned in below cases:

  1. Waiting fro decommission start
  2. Under decommission process
  3. Stopped or terminated after finishing decommission
  4. Under decommission process, then removed by driver with other reasons

In case 4, fetch failure might not be caused by executor decommission. But this is best effort approach based on current mechanism.

Why are the changes needed?

When executor decommission is enabled, there would be more stage failure caused by FetchFailed from decommissioned executor, further causing whole job's failure. One reason is decommissioning executor won't wait all FetchData requests to be finished, it will self-exit when no running tasks and migration finished. It would be better not to count such failure in spark.stage.maxConsecutiveAttempts. AWS EMR already supported this. Please refer https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added test in DAGSchedulerSuite

warrenzhu25 avatar Sep 18 '22 22:09 warrenzhu25

The following claims seems to be too general to be true. Could you be more specific, @warrenzhu25 ?

When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor,

cc @holdenk too

Updated description.

warrenzhu25 avatar Sep 19 '22 00:09 warrenzhu25

Thank you for the swift updates, @warrenzhu25 .

dongjoon-hyun avatar Sep 19 '22 00:09 dongjoon-hyun

I made a few comments. BTW, thank you for working on this area, @warrenzhu25 .

dongjoon-hyun avatar Sep 19 '22 01:09 dongjoon-hyun

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 19 '22 21:09 AmplabJenkins

@dongjoon-hyun Any more comments?

warrenzhu25 avatar Sep 27 '22 18:09 warrenzhu25

Sorry. I was too busy about 3.3.1 preparation. Will resume the review again for this new improvement.

dongjoon-hyun avatar Sep 27 '22 18:09 dongjoon-hyun