[Bug] fix(spark): Triggering Stage retry requires reassigning the shuffle server in the retry Stage.
What changes were proposed in this pull request?
If the Shuffle Server is not reassigned after the Retry is triggered at the Stage, data will be lost. Therefore, reassign the Shuffle Server after the Retry. question: Error: Failures: Error: RSSStageDynamicServerReWriteTest.testRSSStageResubmit:119-SparkIntegrationTestBase.run:64->SparkIntegrationTestBase.verifyTestResult:149 expected: <1000> but was: <970>.
Why are the changes needed?
Fix: #1844
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Presence test.
Test Results
2 966 files +32 2 966 suites +32 6h 31m 6s ⏱️ + 17m 56s 1 096 tests ± 0 1 094 ✅ + 1 2 💤 ±0 0 ❌ - 1 13 735 runs +62 13 705 ✅ +63 30 💤 ±0 0 ❌ - 1
Results for commit e6085475. ± Comparison against base commit ac89c19b.
:recycle: This comment has been updated with latest results.
Codecov Report
Attention: Patch coverage is 13.37209% with 149 lines in your changes missing coverage. Please review.
Project coverage is 51.70%. Comparing base (
910823d) to head (d729702). Report is 5 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #1845 +/- ##
============================================
- Coverage 52.13% 51.70% -0.43%
+ Complexity 3358 2965 -393
============================================
Files 526 477 -49
Lines 29844 22603 -7241
Branches 2560 2079 -481
============================================
- Hits 15559 11687 -3872
+ Misses 13271 10178 -3093
+ Partials 1014 738 -276
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
In this PR build check. integration / -Pspark2.3 has Failures.
Error: Failures: Error: RSSStageDynamicServerReWriteTest.testRSSStageResubmit:119->SparkIntegrationTestBase.run:64->SparkIntegrationTestBase.verifyTestResult:149 expected: <1000> but was: <970>.
but I cannot reappear locally; it may be due to this bug.
In this PR build check. integration / -Pspark2.3 has Failures.
Error: Failures: Error: RSSStageDynamicServerReWriteTest.testRSSStageResubmit:119->SparkIntegrationTestBase.run:64->SparkIntegrationTestBase.verifyTestResult:149 expected: <1000> but was: <970>.but I cannot reappear locally; it may be due to this bug.
Yeah, just fixing that bug.
@jerqi @zuston @rickyma This is ready. Check it out when you have time.Please!
@jerqi @zuston @rickyma @xumanbu There are concurrent bugs in the code on the line, so I want you to take a look at this soon.
We have not tried this feature in prod, let others take a look first.
@zuston said that it's not available to delete the legacy shuffle data. It will cost too much time.
Modify relevant code.
Modify relevant code.
Thanks for your effort, let's reopen this feature together.
The relevant code of this function has been completed, I hope you have a look~ @jerqi @maobaolong @zuston
Could you add a test case for this fix?
Could you add a test case for this fix?
There has been an integration test for this feature:RSSStageDynamicServerReWriteTest.