spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException

Open caican00 opened this issue 2 years ago • 5 comments

What changes were proposed in this pull request?

Abort result stage directly when it failed caused by FetchFailedException.

Why are the changes needed?

Here's a very serious bug: The resultStage with indeterminate parent mapStage resubmit and it led to data inconsistency problems.

And The reasons for data inconsistency are as follows: When result stage failed caused by FetchFailedException,  spark will determine whether it can be retried. And the original condition is numMissingPartitions < resultStage.numTasks. It is not an exact condition.

If this condition holds on retry, at this time some other running tasks at the current failed result stage might not have been killed yet, when result stage was resubmit, it would got wrong partitions to recalculation.

// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() 

It is possible that the number of partitions to be recalculated is smaller than the actual number of partitions at result stage and data inconsistency might occur.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests and new test

caican00 avatar Sep 15 '22 12:09 caican00

gently ping @cloud-fan Could you help to verify this patch?

caican00 avatar Sep 15 '22 12:09 caican00

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 16 '22 19:09 AmplabJenkins

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.

+CC @Ngone51. I will take a look at this PR hopefully early next week.

mridulm avatar Sep 16 '22 22:09 mridulm

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.

+CC @Ngone51. I will take a look at this PR hopefully early next week.

@mridulm Hi, i have updated the description. Could you verify the patch again?

caican00 avatar Sep 20 '22 06:09 caican00

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix. +CC @Ngone51. I will take a look at this PR hopefully early next week.

@mridulm Hi, i have updated the description. Could you verify the patch again?

gently ping @Ngone51

caican00 avatar Sep 20 '22 06:09 caican00

If a result stage does not have pending partitions, it does not need to be aborted - since there are no partitions to be computed.

If a result stage has pending partitions with an indeterminate parent failing, it would have been aborted the first time it failed - so the assumption that If this condition holds on retry, from description does not apply - the first failure would have been aborted already.

Please let me know if there are queries. +CC @Ngone51

mridulm avatar Sep 23 '22 01:09 mridulm

Thanks for the ping. I agree with @mridulm . The original condition doesn't seem to have a chance for the result stage to retry. Is there anything missed?

Ngone51 avatar Sep 26 '22 14:09 Ngone51

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jan 05 '23 00:01 github-actions[bot]