spark
spark copied to clipboard
[SPARK-54556][CORE] Rollback succeeding shuffle map stages when shuffle checksum mismatch detected
What changes were proposed in this pull request?
Rollback shuffle map stages when shuffle checksum mismatch detected:
- cancel and resubmit the stage if it's running;
- clean up the shuffle status to ensure it'll be resubmitted;
- mark rollback attemptId and ignore the results from these elder attempts which may consume inconsistent data;
Why are the changes needed?
To ensure all the succeeding stages will be re-submitted and fully-retry when there is shuffle checksum mismatch detected.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT added.
Was this patch authored or co-authored using generative AI tooling?
No
cc @cloud-fan @mridulm
gentle ping @mridulm in case you missed this
thanks, merging to master/4.1!