incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Bug] fix(spark): Triggering Stage retry requires reassigning the shuffle server in the retry Stage.

Open yl09099 opened this issue 1 year ago • 8 comments

What changes were proposed in this pull request?

If the Shuffle Server is not reassigned after the Retry is triggered at the Stage, data will be lost. Therefore, reassign the Shuffle Server after the Retry. question: Error: Failures: Error: RSSStageDynamicServerReWriteTest.testRSSStageResubmit:119-SparkIntegrationTestBase.run:64->SparkIntegrationTestBase.verifyTestResult:149 expected: <1000> but was: <970>.

Why are the changes needed?

Fix: #1844

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Presence test.

yl09099 avatar Jun 30 '24 15:06 yl09099

Test Results

 2 966 files  +32   2 966 suites  +32   6h 31m 6s ⏱️ + 17m 56s  1 096 tests ± 0   1 094 ✅ + 1   2 💤 ±0  0 ❌  - 1  13 735 runs  +62  13 705 ✅ +63  30 💤 ±0  0 ❌  - 1 

Results for commit e6085475. ± Comparison against base commit ac89c19b.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jun 30 '24 15:06 github-actions[bot]

Codecov Report

Attention: Patch coverage is 13.37209% with 149 lines in your changes missing coverage. Please review.

Project coverage is 51.70%. Comparing base (910823d) to head (d729702). Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...fle/shuffle/manager/ShuffleManagerGrpcService.java 20.68% 67 Missing and 2 partials :warning:
...uniffle/shuffle/manager/RssShuffleManagerBase.java 0.00% 26 Missing :warning:
.../shuffle/BlockIdSelfManagedShuffleWriteClient.java 0.00% 12 Missing :warning:
...pache/uniffle/server/ShuffleServerGrpcService.java 0.00% 10 Missing :warning:
...ava/org/apache/uniffle/shuffle/BlockIdManager.java 0.00% 9 Missing :warning:
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 0.00% 9 Missing :warning:
...nt/request/RssPartitionToShuffleServerRequest.java 0.00% 8 Missing :warning:
...pache/spark/shuffle/writer/WriteBufferManager.java 66.66% 2 Missing :warning:
...t/request/RssReportShuffleWriteFailureRequest.java 0.00% 2 Missing :warning:
.../apache/uniffle/client/api/ShuffleWriteClient.java 0.00% 1 Missing :warning:
... and 1 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1845      +/-   ##
============================================
- Coverage     52.13%   51.70%   -0.43%     
+ Complexity     3358     2965     -393     
============================================
  Files           526      477      -49     
  Lines         29844    22603    -7241     
  Branches       2560     2079     -481     
============================================
- Hits          15559    11687    -3872     
+ Misses        13271    10178    -3093     
+ Partials       1014      738     -276     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jun 30 '24 16:06 codecov-commenter

In this PR build check. integration / -Pspark2.3 has Failures.

Error: Failures: Error: RSSStageDynamicServerReWriteTest.testRSSStageResubmit:119->SparkIntegrationTestBase.run:64->SparkIntegrationTestBase.verifyTestResult:149 expected: <1000> but was: <970>.

but I cannot reappear locally; it may be due to this bug.

xumanbu avatar Jul 04 '24 12:07 xumanbu

In this PR build check. integration / -Pspark2.3 has Failures.

Error: Failures: Error: RSSStageDynamicServerReWriteTest.testRSSStageResubmit:119->SparkIntegrationTestBase.run:64->SparkIntegrationTestBase.verifyTestResult:149 expected: <1000> but was: <970>.

but I cannot reappear locally; it may be due to this bug.

Yeah, just fixing that bug.

yl09099 avatar Jul 05 '24 02:07 yl09099

@jerqi @zuston @rickyma This is ready. Check it out when you have time.Please!

yl09099 avatar Jul 05 '24 09:07 yl09099

@jerqi @zuston @rickyma @xumanbu There are concurrent bugs in the code on the line, so I want you to take a look at this soon.

yl09099 avatar Jul 09 '24 06:07 yl09099

We have not tried this feature in prod, let others take a look first.

rickyma avatar Jul 09 '24 07:07 rickyma

@zuston said that it's not available to delete the legacy shuffle data. It will cost too much time.

jerqi avatar Aug 05 '24 02:08 jerqi

Modify relevant code.

yl09099 avatar Nov 26 '24 07:11 yl09099

Modify relevant code.

Thanks for your effort, let's reopen this feature together.

zuston avatar Nov 27 '24 05:11 zuston

The relevant code of this function has been completed, I hope you have a look~ @jerqi @maobaolong @zuston

yl09099 avatar Nov 27 '24 14:11 yl09099

Could you add a test case for this fix?

jerqi avatar Nov 28 '24 08:11 jerqi

Could you add a test case for this fix?

There has been an integration test for this feature:RSSStageDynamicServerReWriteTest.

yl09099 avatar Nov 29 '24 05:11 yl09099