spark
spark copied to clipboard
[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException
What changes were proposed in this pull request?
Abort result stage directly when it failed caused by FetchFailedException.
Why are the changes needed?
Here's a very serious bug:
The resultStage with indeterminate parent mapStage resubmit and it led to data inconsistency problems.
And The reasons for data inconsistency are as follows:
When result stage failed caused by FetchFailedException
, spark will determine whether it can be retried.
And the original condition is numMissingPartitions < resultStage.numTasks
. It is not an exact condition.
If this condition holds on retry, at this time some other running tasks at the current failed result stage might not have been killed yet, when result stage was resubmit, it would got wrong partitions to recalculation.
// DAGScheduler#submitMissingTasks
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
It is possible that the number of partitions to be recalculated is smaller than the actual number of partitions at result stage and data inconsistency might occur.
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests and new test
gently ping @cloud-fan Could you help to verify this patch?
Can one of the admins verify this patch?
Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.
+CC @Ngone51. I will take a look at this PR hopefully early next week.
Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.
+CC @Ngone51. I will take a look at this PR hopefully early next week.
@mridulm Hi, i have updated the description. Could you verify the patch again?
Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix. +CC @Ngone51. I will take a look at this PR hopefully early next week.
@mridulm Hi, i have updated the description. Could you verify the patch again?
gently ping @Ngone51
If a result stage does not have pending partitions, it does not need to be aborted - since there are no partitions to be computed.
If a result stage has pending partitions with an indeterminate parent failing, it would have been aborted the first time it failed - so the assumption that If this condition holds on retry,
from description does not apply - the first failure would have been aborted already.
Please let me know if there are queries. +CC @Ngone51
Thanks for the ping. I agree with @mridulm . The original condition doesn't seem to have a chance for the result stage to retry. Is there anything missed?
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!