spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-54556][CORE] Rollback succeeding shuffle map stages when shuffle checksum mismatch detected

Open ivoson opened this issue 4 weeks ago • 1 comments

What changes were proposed in this pull request?

Rollback shuffle map stages when shuffle checksum mismatch detected:

  • cancel and resubmit the stage if it's running;
  • clean up the shuffle status to ensure it'll be resubmitted;
  • mark rollback attemptId and ignore the results from these elder attempts which may consume inconsistent data;

Why are the changes needed?

To ensure all the succeeding stages will be re-submitted and fully-retry when there is shuffle checksum mismatch detected.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT added.

Was this patch authored or co-authored using generative AI tooling?

No

ivoson avatar Dec 02 '25 01:12 ivoson

cc @cloud-fan @mridulm

ivoson avatar Dec 09 '25 08:12 ivoson

gentle ping @mridulm in case you missed this

ivoson avatar Dec 15 '25 09:12 ivoson

thanks, merging to master/4.1!

cloud-fan avatar Dec 19 '25 16:12 cloud-fan